Re: How much disk space does optimize really take

2009-10-07 Thread Yonik Seeley
On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote:

 In a separate thread, I've detailed how an optimize is taking  2x disk
 space. We don't use solr distribution/snapshooter.  We are using the default
 deletion policy = 1. We can't optimize a 192G index in 400GB of space.

 This thread in lucene/java-user

 http://www.gossamer-threads.com/lists/lucene/java-user/43475

 suggests that an optimize should not take  2x unless perhaps an IndexReader
 is holding on to segments. This could be our problem since when optimization
 runs out of space, if we stop tomcat, a number of files go away and space is
 recovered.

 But we are not searching the index so how could a Searcher/IndexReader have
 any segments open?

 I notice in the logs that as part of routine commits or as part of optimize
 a Searcher is registered and autowarmed from a previous searcher (of course
 there's nothing in the caches -- this is just a build machine).

 INFO: registering core:
 Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
 INFO: [] Registered new searcher searc...@2e097617 main

 Does this means that there's always a lucene IndexReader holding segment
 files open so they can't be deleted during an optimize so we run out of disk
 space  2x?

Yes.
A feature could probably now be developed now that avoids opening a
reader until it's requested.
That wasn't really possible in the past - due to many issues such as
Lucene autocommit.

-Yonik
http://www.lucidimagination.com


Re: How much disk space does optimize really take

2009-10-07 Thread Jason Rutherglen
It would be good to be able to commit without opening a new
reader however with Lucene 2.9 the segment readers for all
available segments are already created and available via
getReader which manages the reference counting internally.

Using reopen redundantly creates SRs that are already held
internally in IW.

On Wed, Oct 7, 2009 at 9:59 AM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote:

 In a separate thread, I've detailed how an optimize is taking  2x disk
 space. We don't use solr distribution/snapshooter.  We are using the default
 deletion policy = 1. We can't optimize a 192G index in 400GB of space.

 This thread in lucene/java-user

 http://www.gossamer-threads.com/lists/lucene/java-user/43475

 suggests that an optimize should not take  2x unless perhaps an IndexReader
 is holding on to segments. This could be our problem since when optimization
 runs out of space, if we stop tomcat, a number of files go away and space is
 recovered.

 But we are not searching the index so how could a Searcher/IndexReader have
 any segments open?

 I notice in the logs that as part of routine commits or as part of optimize
 a Searcher is registered and autowarmed from a previous searcher (of course
 there's nothing in the caches -- this is just a build machine).

 INFO: registering core:
 Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
 INFO: [] Registered new searcher searc...@2e097617 main

 Does this means that there's always a lucene IndexReader holding segment
 files open so they can't be deleted during an optimize so we run out of disk
 space  2x?

 Yes.
 A feature could probably now be developed now that avoids opening a
 reader until it's requested.
 That wasn't really possible in the past - due to many issues such as
 Lucene autocommit.

 -Yonik
 http://www.lucidimagination.com



Re: How much disk space does optimize really take

2009-10-07 Thread Shalin Shekhar Mangar
On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 It would be good to be able to commit without opening a new
 reader however with Lucene 2.9 the segment readers for all
 available segments are already created and available via
 getReader which manages the reference counting internally.

 Using reopen redundantly creates SRs that are already held
 internally in IW.


Jason, I think this is something we should consider changing. A user who is
not using NRT features should not pay the price of keeping readers opened.
We are also interested in opening a searcher just-in-time for SOLR-1293. We
have use-cases where a SolrCore is loaded only for indexing and then
unloaded.

-- 
Regards,
Shalin Shekhar Mangar.


Re: How much disk space does optimize really take

2009-10-07 Thread Mark Miller
I think that argument requires auto commit to be on and opening readers
after the optimize starts? Otherwise, the optimized version is not put
into place until a commit is called, and a Reader won't see the newly
merged segments until then - so the original index is kept around in
either case - having a Reader open on it shouldn't affect the space
requirements?

Yonik Seeley wrote:
 On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote:
   
 In a separate thread, I've detailed how an optimize is taking  2x disk
 space. We don't use solr distribution/snapshooter.  We are using the default
 deletion policy = 1. We can't optimize a 192G index in 400GB of space.

 This thread in lucene/java-user

 http://www.gossamer-threads.com/lists/lucene/java-user/43475

 suggests that an optimize should not take  2x unless perhaps an IndexReader
 is holding on to segments. This could be our problem since when optimization
 runs out of space, if we stop tomcat, a number of files go away and space is
 recovered.

 But we are not searching the index so how could a Searcher/IndexReader have
 any segments open?

 I notice in the logs that as part of routine commits or as part of optimize
 a Searcher is registered and autowarmed from a previous searcher (of course
 there's nothing in the caches -- this is just a build machine).

 INFO: registering core:
 Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher
 INFO: [] Registered new searcher searc...@2e097617 main

 Does this means that there's always a lucene IndexReader holding segment
 files open so they can't be deleted during an optimize so we run out of disk
 space  2x?
 

 Yes.
 A feature could probably now be developed now that avoids opening a
 reader until it's requested.
 That wasn't really possible in the past - due to many issues such as
 Lucene autocommit.

 -Yonik
 http://www.lucidimagination.com
   


-- 
- Mark

http://www.lucidimagination.com





Re: How much disk space does optimize really take

2009-10-07 Thread Phillip Farber



Yonik Seeley wrote:



Does this means that there's always a lucene IndexReader holding segment
files open so they can't be deleted during an optimize so we run out of disk
space  2x?


Yes.
A feature could probably now be developed now that avoids opening a
reader until it's requested.
That wasn't really possible in the past - due to many issues such as
Lucene autocommit.



So this implies that for a normal optimize, in every case, due to the 
Searcher holding open the existing segment prior to optimize that we'd 
always need 3x even in the normal case.


This seems wrong since it is repeated stated that in the normal case 
only 2x is needed and I have successfully optimized a similar sized 192G 
index on identical hardware with a 400G capacity.


Yonik, I'm uncertain then about what you're saying about required disk 
space ofr optimize.  Could you clarify?





-Yonik
http://www.lucidimagination.com


Re: How much disk space does optimize really take

2009-10-07 Thread Jason Rutherglen
To be clear, the SRs created by merges don't have the term index
loaded which is the main cost.  One would need to use
IndexReaderWarmer to load the term index before the new SR becomes a
part of SegmentInfos.

On Wed, Oct 7, 2009 at 10:34 AM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 It would be good to be able to commit without opening a new
 reader however with Lucene 2.9 the segment readers for all
 available segments are already created and available via
 getReader which manages the reference counting internally.

 Using reopen redundantly creates SRs that are already held
 internally in IW.


 Jason, I think this is something we should consider changing. A user who is
 not using NRT features should not pay the price of keeping readers opened.
 We are also interested in opening a searcher just-in-time for SOLR-1293. We
 have use-cases where a SolrCore is loaded only for indexing and then
 unloaded.

 --
 Regards,
 Shalin Shekhar Mangar.



Re: How much disk space does optimize really take

2009-10-07 Thread Yonik Seeley
On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote:
 So this implies that for a normal optimize, in every case, due to the
 Searcher holding open the existing segment prior to optimize that we'd
 always need 3x even in the normal case.

 This seems wrong since it is repeated stated that in the normal case only 2x
 is needed and I have successfully optimized a similar sized 192G index on
 identical hardware with a 400G capacity.

2x for the IndexWriter only.
Having an open index reader can increase that somewhat... 3x is the
absolute worst case I think and that can currently be avoided by first
calling commit and then calling optimize I think.  This way the open
reader will only be holding references to segments that wouldn't be
deleted until the optimize is complete anyway.


-Yonik
http://www.lucidimagination.com


Re: How much disk space does optimize really take

2009-10-07 Thread Michael McCandless
On Wed, Oct 7, 2009 at 1:34 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 It would be good to be able to commit without opening a new
 reader however with Lucene 2.9 the segment readers for all
 available segments are already created and available via
 getReader which manages the reference counting internally.

 Using reopen redundantly creates SRs that are already held
 internally in IW.


 Jason, I think this is something we should consider changing. A user who is
 not using NRT features should not pay the price of keeping readers opened.
 We are also interested in opening a searcher just-in-time for SOLR-1293. We
 have use-cases where a SolrCore is loaded only for indexing and then
 unloaded.

This is already true today.

If you don't use NRT then the readers are not held open by Lucene.

Mike


Re: How much disk space does optimize really take

2009-10-07 Thread Phillip Farber
Wow, this is weird.  I commit before I optimize.  In fact, I bounce 
tomcat before I optimize just in case. It makse sense, as you say, that 
then the open reader can only be holding references to segments that 
wouldn't be deleted until the optimize is complete anyway.


But we're still exceeding 2x. And after the optimize fails, if we then 
do a commit or bounce tomcat, a bunch of segments disappear. I am stumped.


Yonik Seeley wrote:

On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote:

So this implies that for a normal optimize, in every case, due to the
Searcher holding open the existing segment prior to optimize that we'd
always need 3x even in the normal case.

This seems wrong since it is repeated stated that in the normal case only 2x
is needed and I have successfully optimized a similar sized 192G index on
identical hardware with a 400G capacity.


2x for the IndexWriter only.
Having an open index reader can increase that somewhat... 3x is the
absolute worst case I think and that can currently be avoided by first
calling commit and then calling optimize I think.  This way the open
reader will only be holding references to segments that wouldn't be
deleted until the optimize is complete anyway.


-Yonik
http://www.lucidimagination.com


Re: How much disk space does optimize really take

2009-10-07 Thread Lance Norskog
Oops, send before finished.  Partial Optimize aka maxSegments is a
recent Solr 1.4/Lucene 2.9 feature.

As to 2x v.s. 3x, the general wisdom is that an optimize on a simple
index takes at most 2x disk space, and on a compound index takes at
most 3x. Simple is the default (*). At Divvio we had the same
problem and it never took up more than 2x.

If your index disks are really bursting at the seams, you could try
creating an empty index on a separate disk and merging your large
index into that index. The resulting index will be mostly optimized.

Lance Norskog

* in solrconfig.xml:
useCompoundFilefalse/useCompoundFile

On 10/7/09, Phillip Farber pfar...@umich.edu wrote:
 Wow, this is weird.  I commit before I optimize.  In fact, I bounce
 tomcat before I optimize just in case. It makse sense, as you say, that
 then the open reader can only be holding references to segments that
 wouldn't be deleted until the optimize is complete anyway.

 But we're still exceeding 2x. And after the optimize fails, if we then
 do a commit or bounce tomcat, a bunch of segments disappear. I am stumped.

 Yonik Seeley wrote:
 On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote:
 So this implies that for a normal optimize, in every case, due to the
 Searcher holding open the existing segment prior to optimize that we'd
 always need 3x even in the normal case.

 This seems wrong since it is repeated stated that in the normal case only
 2x
 is needed and I have successfully optimized a similar sized 192G index on
 identical hardware with a 400G capacity.

 2x for the IndexWriter only.
 Having an open index reader can increase that somewhat... 3x is the
 absolute worst case I think and that can currently be avoided by first
 calling commit and then calling optimize I think.  This way the open
 reader will only be holding references to segments that wouldn't be
 deleted until the optimize is complete anyway.


 -Yonik
 http://www.lucidimagination.com



-- 
Lance Norskog
goks...@gmail.com


Re: How much disk space does optimize really take

2009-10-07 Thread Yonik Seeley
On Wed, Oct 7, 2009 at 3:16 PM, Phillip Farber pfar...@umich.edu wrote:
 Wow, this is weird.  I commit before I optimize.  In fact, I bounce tomcat
 before I optimize just in case. It makse sense, as you say, that then the
 open reader can only be holding references to segments that wouldn't be
 deleted until the optimize is complete anyway.

 But we're still exceeding 2x.

How much over 2x?
It is possible (though relatively rare) for an optimized index to be
larger than a non-optimized index.

-Yonik
http://www.lucidimagination.com


Re: How much disk space does optimize really take

2009-10-07 Thread Mark Miller
I can't tell why calling a commit or restarting is going to help
anything - or why you need more than 2x in any case. The only reason i
can see this being is if you have turned on auto-commit. Otherwise the
Reader is *always* only referencing what would have to be around anyway.

Your likely to just too close to the edge. There are fragmentation
issues and whatnot when your dealing with such large files and so little
space above what you need.

Phillip Farber wrote:
 Wow, this is weird.  I commit before I optimize.  In fact, I bounce
 tomcat before I optimize just in case. It makse sense, as you say,
 that then the open reader can only be holding references to segments
 that wouldn't be deleted until the optimize is complete anyway.

 But we're still exceeding 2x. And after the optimize fails, if we then
 do a commit or bounce tomcat, a bunch of segments disappear. I am
 stumped.

 Yonik Seeley wrote:
 On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu
 wrote:
 So this implies that for a normal optimize, in every case, due to the
 Searcher holding open the existing segment prior to optimize that we'd
 always need 3x even in the normal case.

 This seems wrong since it is repeated stated that in the normal case
 only 2x
 is needed and I have successfully optimized a similar sized 192G
 index on
 identical hardware with a 400G capacity.

 2x for the IndexWriter only.
 Having an open index reader can increase that somewhat... 3x is the
 absolute worst case I think and that can currently be avoided by first
 calling commit and then calling optimize I think.  This way the open
 reader will only be holding references to segments that wouldn't be
 deleted until the optimize is complete anyway.


 -Yonik
 http://www.lucidimagination.com


-- 
- Mark

http://www.lucidimagination.com





Re: How much disk space does optimize really take

2009-10-07 Thread Mark Miller
Okay - I think I've got you - your talking about the case of adding a
bunch of docs, not calling commit, and then trying to optimize. I keep
coming at it from a cold optimize. Making sense to me now.

Mark Miller wrote:
 I can't tell why calling a commit or restarting is going to help
 anything - or why you need more than 2x in any case. The only reason i
 can see this being is if you have turned on auto-commit. Otherwise the
 Reader is *always* only referencing what would have to be around anyway.

 Your likely to just too close to the edge. There are fragmentation
 issues and whatnot when your dealing with such large files and so little
 space above what you need.

 Phillip Farber wrote:
   
 Wow, this is weird.  I commit before I optimize.  In fact, I bounce
 tomcat before I optimize just in case. It makse sense, as you say,
 that then the open reader can only be holding references to segments
 that wouldn't be deleted until the optimize is complete anyway.

 But we're still exceeding 2x. And after the optimize fails, if we then
 do a commit or bounce tomcat, a bunch of segments disappear. I am
 stumped.

 Yonik Seeley wrote:
 
 On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu
 wrote:
   
 So this implies that for a normal optimize, in every case, due to the
 Searcher holding open the existing segment prior to optimize that we'd
 always need 3x even in the normal case.

 This seems wrong since it is repeated stated that in the normal case
 only 2x
 is needed and I have successfully optimized a similar sized 192G
 index on
 identical hardware with a 400G capacity.
 
 2x for the IndexWriter only.
 Having an open index reader can increase that somewhat... 3x is the
 absolute worst case I think and that can currently be avoided by first
 calling commit and then calling optimize I think.  This way the open
 reader will only be holding references to segments that wouldn't be
 deleted until the optimize is complete anyway.


 -Yonik
 http://www.lucidimagination.com
   


   


-- 
- Mark

http://www.lucidimagination.com





Re: How much disk space does optimize really take

2009-10-07 Thread Yonik Seeley
On Wed, Oct 7, 2009 at 3:31 PM, Mark Miller markrmil...@gmail.com wrote:
 I can't tell why calling a commit or restarting is going to help
 anything

Depends on what scenarios you consider, and what you are taking 2x of.

1) Open reader on index
2) Open writer and add two documents... the first causes a large
merge, and the second is just to make it a non-optimized index.
  At this point youre already at 2x of your original index size.
3) call optimize()... this will make a 3rd copy before deleting the 2nd.

-Yonik
http://www.lucidimagination.com


Re: How much disk space does optimize really take

2009-10-07 Thread Mark Miller
Yonik Seeley wrote:
 On Wed, Oct 7, 2009 at 3:31 PM, Mark Miller markrmil...@gmail.com wrote:
   
 I can't tell why calling a commit or restarting is going to help
 anything
 

 Depends on what scenarios you consider, and what you are taking 2x of.

 1) Open reader on index
 2) Open writer and add two documents... the first causes a large
 merge, and the second is just to make it a non-optimized index.
   At this point youre already at 2x of your original index size.
 3) call optimize()... this will make a 3rd copy before deleting the 2nd.

 -Yonik
 http://www.lucidimagination.com
   
Yup - finally hit me what you were talking about. Wasn't considering the
case of adding docs to an existing index, not committing, and then
trying to optimize.

I like trying to take an opposing side from you anyway - it means I know
where I will end up - but your usually so darn terse, I never know how
long till I end up there.

Anyway, so all you generally *need* is 2x, you just have to make sure
your not adding docs first without committing them - which I was taking
for granted. But means your comment of calling commit makes perfect sense.

I guess you can't guarantee 2x though, as if you have queries coming in
that take a while, a commit opening a new Reader will not guarantee the
old Reader is quite ready to go away. Might want to wait a short bit
after the commit.

-- 
- Mark

http://www.lucidimagination.com





Re: How much disk space does optimize really take

2009-10-07 Thread Yonik Seeley
On Wed, Oct 7, 2009 at 3:56 PM, Mark Miller markrmil...@gmail.com wrote:
 I guess you can't guarantee 2x though, as if you have queries coming in
 that take a while, a commit opening a new Reader will not guarantee the
 old Reader is quite ready to go away. Might want to wait a short bit
 after the commit.

Right - and in a complete system, there are other things that can also
hold commit points open longer, like index replication.

-Yonik
http://www.lucidimagination.com