Re: How much disk space does optimize really take
On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote: In a separate thread, I've detailed how an optimize is taking 2x disk space. We don't use solr distribution/snapshooter. We are using the default deletion policy = 1. We can't optimize a 192G index in 400GB of space. This thread in lucene/java-user http://www.gossamer-threads.com/lists/lucene/java-user/43475 suggests that an optimize should not take 2x unless perhaps an IndexReader is holding on to segments. This could be our problem since when optimization runs out of space, if we stop tomcat, a number of files go away and space is recovered. But we are not searching the index so how could a Searcher/IndexReader have any segments open? I notice in the logs that as part of routine commits or as part of optimize a Searcher is registered and autowarmed from a previous searcher (of course there's nothing in the caches -- this is just a build machine). INFO: registering core: Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher INFO: [] Registered new searcher searc...@2e097617 main Does this means that there's always a lucene IndexReader holding segment files open so they can't be deleted during an optimize so we run out of disk space 2x? Yes. A feature could probably now be developed now that avoids opening a reader until it's requested. That wasn't really possible in the past - due to many issues such as Lucene autocommit. -Yonik http://www.lucidimagination.com
Re: How much disk space does optimize really take
It would be good to be able to commit without opening a new reader however with Lucene 2.9 the segment readers for all available segments are already created and available via getReader which manages the reference counting internally. Using reopen redundantly creates SRs that are already held internally in IW. On Wed, Oct 7, 2009 at 9:59 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote: In a separate thread, I've detailed how an optimize is taking 2x disk space. We don't use solr distribution/snapshooter. We are using the default deletion policy = 1. We can't optimize a 192G index in 400GB of space. This thread in lucene/java-user http://www.gossamer-threads.com/lists/lucene/java-user/43475 suggests that an optimize should not take 2x unless perhaps an IndexReader is holding on to segments. This could be our problem since when optimization runs out of space, if we stop tomcat, a number of files go away and space is recovered. But we are not searching the index so how could a Searcher/IndexReader have any segments open? I notice in the logs that as part of routine commits or as part of optimize a Searcher is registered and autowarmed from a previous searcher (of course there's nothing in the caches -- this is just a build machine). INFO: registering core: Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher INFO: [] Registered new searcher searc...@2e097617 main Does this means that there's always a lucene IndexReader holding segment files open so they can't be deleted during an optimize so we run out of disk space 2x? Yes. A feature could probably now be developed now that avoids opening a reader until it's requested. That wasn't really possible in the past - due to many issues such as Lucene autocommit. -Yonik http://www.lucidimagination.com
Re: How much disk space does optimize really take
On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: It would be good to be able to commit without opening a new reader however with Lucene 2.9 the segment readers for all available segments are already created and available via getReader which manages the reference counting internally. Using reopen redundantly creates SRs that are already held internally in IW. Jason, I think this is something we should consider changing. A user who is not using NRT features should not pay the price of keeping readers opened. We are also interested in opening a searcher just-in-time for SOLR-1293. We have use-cases where a SolrCore is loaded only for indexing and then unloaded. -- Regards, Shalin Shekhar Mangar.
Re: How much disk space does optimize really take
I think that argument requires auto commit to be on and opening readers after the optimize starts? Otherwise, the optimized version is not put into place until a commit is called, and a Reader won't see the newly merged segments until then - so the original index is kept around in either case - having a Reader open on it shouldn't affect the space requirements? Yonik Seeley wrote: On Wed, Oct 7, 2009 at 12:51 PM, Phillip Farber pfar...@umich.edu wrote: In a separate thread, I've detailed how an optimize is taking 2x disk space. We don't use solr distribution/snapshooter. We are using the default deletion policy = 1. We can't optimize a 192G index in 400GB of space. This thread in lucene/java-user http://www.gossamer-threads.com/lists/lucene/java-user/43475 suggests that an optimize should not take 2x unless perhaps an IndexReader is holding on to segments. This could be our problem since when optimization runs out of space, if we stop tomcat, a number of files go away and space is recovered. But we are not searching the index so how could a Searcher/IndexReader have any segments open? I notice in the logs that as part of routine commits or as part of optimize a Searcher is registered and autowarmed from a previous searcher (of course there's nothing in the caches -- this is just a build machine). INFO: registering core: Oct 6, 2009 2:16:20 PM org.apache.solr.core.SolrCore registerSearcher INFO: [] Registered new searcher searc...@2e097617 main Does this means that there's always a lucene IndexReader holding segment files open so they can't be deleted during an optimize so we run out of disk space 2x? Yes. A feature could probably now be developed now that avoids opening a reader until it's requested. That wasn't really possible in the past - due to many issues such as Lucene autocommit. -Yonik http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com
Re: How much disk space does optimize really take
Yonik Seeley wrote: Does this means that there's always a lucene IndexReader holding segment files open so they can't be deleted during an optimize so we run out of disk space 2x? Yes. A feature could probably now be developed now that avoids opening a reader until it's requested. That wasn't really possible in the past - due to many issues such as Lucene autocommit. So this implies that for a normal optimize, in every case, due to the Searcher holding open the existing segment prior to optimize that we'd always need 3x even in the normal case. This seems wrong since it is repeated stated that in the normal case only 2x is needed and I have successfully optimized a similar sized 192G index on identical hardware with a 400G capacity. Yonik, I'm uncertain then about what you're saying about required disk space ofr optimize. Could you clarify? -Yonik http://www.lucidimagination.com
Re: How much disk space does optimize really take
To be clear, the SRs created by merges don't have the term index loaded which is the main cost. One would need to use IndexReaderWarmer to load the term index before the new SR becomes a part of SegmentInfos. On Wed, Oct 7, 2009 at 10:34 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: It would be good to be able to commit without opening a new reader however with Lucene 2.9 the segment readers for all available segments are already created and available via getReader which manages the reference counting internally. Using reopen redundantly creates SRs that are already held internally in IW. Jason, I think this is something we should consider changing. A user who is not using NRT features should not pay the price of keeping readers opened. We are also interested in opening a searcher just-in-time for SOLR-1293. We have use-cases where a SolrCore is loaded only for indexing and then unloaded. -- Regards, Shalin Shekhar Mangar.
Re: How much disk space does optimize really take
On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote: So this implies that for a normal optimize, in every case, due to the Searcher holding open the existing segment prior to optimize that we'd always need 3x even in the normal case. This seems wrong since it is repeated stated that in the normal case only 2x is needed and I have successfully optimized a similar sized 192G index on identical hardware with a 400G capacity. 2x for the IndexWriter only. Having an open index reader can increase that somewhat... 3x is the absolute worst case I think and that can currently be avoided by first calling commit and then calling optimize I think. This way the open reader will only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. -Yonik http://www.lucidimagination.com
Re: How much disk space does optimize really take
On Wed, Oct 7, 2009 at 1:34 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, Oct 7, 2009 at 10:45 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: It would be good to be able to commit without opening a new reader however with Lucene 2.9 the segment readers for all available segments are already created and available via getReader which manages the reference counting internally. Using reopen redundantly creates SRs that are already held internally in IW. Jason, I think this is something we should consider changing. A user who is not using NRT features should not pay the price of keeping readers opened. We are also interested in opening a searcher just-in-time for SOLR-1293. We have use-cases where a SolrCore is loaded only for indexing and then unloaded. This is already true today. If you don't use NRT then the readers are not held open by Lucene. Mike
Re: How much disk space does optimize really take
Wow, this is weird. I commit before I optimize. In fact, I bounce tomcat before I optimize just in case. It makse sense, as you say, that then the open reader can only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. But we're still exceeding 2x. And after the optimize fails, if we then do a commit or bounce tomcat, a bunch of segments disappear. I am stumped. Yonik Seeley wrote: On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote: So this implies that for a normal optimize, in every case, due to the Searcher holding open the existing segment prior to optimize that we'd always need 3x even in the normal case. This seems wrong since it is repeated stated that in the normal case only 2x is needed and I have successfully optimized a similar sized 192G index on identical hardware with a 400G capacity. 2x for the IndexWriter only. Having an open index reader can increase that somewhat... 3x is the absolute worst case I think and that can currently be avoided by first calling commit and then calling optimize I think. This way the open reader will only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. -Yonik http://www.lucidimagination.com
Re: How much disk space does optimize really take
Oops, send before finished. Partial Optimize aka maxSegments is a recent Solr 1.4/Lucene 2.9 feature. As to 2x v.s. 3x, the general wisdom is that an optimize on a simple index takes at most 2x disk space, and on a compound index takes at most 3x. Simple is the default (*). At Divvio we had the same problem and it never took up more than 2x. If your index disks are really bursting at the seams, you could try creating an empty index on a separate disk and merging your large index into that index. The resulting index will be mostly optimized. Lance Norskog * in solrconfig.xml: useCompoundFilefalse/useCompoundFile On 10/7/09, Phillip Farber pfar...@umich.edu wrote: Wow, this is weird. I commit before I optimize. In fact, I bounce tomcat before I optimize just in case. It makse sense, as you say, that then the open reader can only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. But we're still exceeding 2x. And after the optimize fails, if we then do a commit or bounce tomcat, a bunch of segments disappear. I am stumped. Yonik Seeley wrote: On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote: So this implies that for a normal optimize, in every case, due to the Searcher holding open the existing segment prior to optimize that we'd always need 3x even in the normal case. This seems wrong since it is repeated stated that in the normal case only 2x is needed and I have successfully optimized a similar sized 192G index on identical hardware with a 400G capacity. 2x for the IndexWriter only. Having an open index reader can increase that somewhat... 3x is the absolute worst case I think and that can currently be avoided by first calling commit and then calling optimize I think. This way the open reader will only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. -Yonik http://www.lucidimagination.com -- Lance Norskog goks...@gmail.com
Re: How much disk space does optimize really take
On Wed, Oct 7, 2009 at 3:16 PM, Phillip Farber pfar...@umich.edu wrote: Wow, this is weird. I commit before I optimize. In fact, I bounce tomcat before I optimize just in case. It makse sense, as you say, that then the open reader can only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. But we're still exceeding 2x. How much over 2x? It is possible (though relatively rare) for an optimized index to be larger than a non-optimized index. -Yonik http://www.lucidimagination.com
Re: How much disk space does optimize really take
I can't tell why calling a commit or restarting is going to help anything - or why you need more than 2x in any case. The only reason i can see this being is if you have turned on auto-commit. Otherwise the Reader is *always* only referencing what would have to be around anyway. Your likely to just too close to the edge. There are fragmentation issues and whatnot when your dealing with such large files and so little space above what you need. Phillip Farber wrote: Wow, this is weird. I commit before I optimize. In fact, I bounce tomcat before I optimize just in case. It makse sense, as you say, that then the open reader can only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. But we're still exceeding 2x. And after the optimize fails, if we then do a commit or bounce tomcat, a bunch of segments disappear. I am stumped. Yonik Seeley wrote: On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote: So this implies that for a normal optimize, in every case, due to the Searcher holding open the existing segment prior to optimize that we'd always need 3x even in the normal case. This seems wrong since it is repeated stated that in the normal case only 2x is needed and I have successfully optimized a similar sized 192G index on identical hardware with a 400G capacity. 2x for the IndexWriter only. Having an open index reader can increase that somewhat... 3x is the absolute worst case I think and that can currently be avoided by first calling commit and then calling optimize I think. This way the open reader will only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. -Yonik http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com
Re: How much disk space does optimize really take
Okay - I think I've got you - your talking about the case of adding a bunch of docs, not calling commit, and then trying to optimize. I keep coming at it from a cold optimize. Making sense to me now. Mark Miller wrote: I can't tell why calling a commit or restarting is going to help anything - or why you need more than 2x in any case. The only reason i can see this being is if you have turned on auto-commit. Otherwise the Reader is *always* only referencing what would have to be around anyway. Your likely to just too close to the edge. There are fragmentation issues and whatnot when your dealing with such large files and so little space above what you need. Phillip Farber wrote: Wow, this is weird. I commit before I optimize. In fact, I bounce tomcat before I optimize just in case. It makse sense, as you say, that then the open reader can only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. But we're still exceeding 2x. And after the optimize fails, if we then do a commit or bounce tomcat, a bunch of segments disappear. I am stumped. Yonik Seeley wrote: On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote: So this implies that for a normal optimize, in every case, due to the Searcher holding open the existing segment prior to optimize that we'd always need 3x even in the normal case. This seems wrong since it is repeated stated that in the normal case only 2x is needed and I have successfully optimized a similar sized 192G index on identical hardware with a 400G capacity. 2x for the IndexWriter only. Having an open index reader can increase that somewhat... 3x is the absolute worst case I think and that can currently be avoided by first calling commit and then calling optimize I think. This way the open reader will only be holding references to segments that wouldn't be deleted until the optimize is complete anyway. -Yonik http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com
Re: How much disk space does optimize really take
On Wed, Oct 7, 2009 at 3:31 PM, Mark Miller markrmil...@gmail.com wrote: I can't tell why calling a commit or restarting is going to help anything Depends on what scenarios you consider, and what you are taking 2x of. 1) Open reader on index 2) Open writer and add two documents... the first causes a large merge, and the second is just to make it a non-optimized index. At this point youre already at 2x of your original index size. 3) call optimize()... this will make a 3rd copy before deleting the 2nd. -Yonik http://www.lucidimagination.com
Re: How much disk space does optimize really take
Yonik Seeley wrote: On Wed, Oct 7, 2009 at 3:31 PM, Mark Miller markrmil...@gmail.com wrote: I can't tell why calling a commit or restarting is going to help anything Depends on what scenarios you consider, and what you are taking 2x of. 1) Open reader on index 2) Open writer and add two documents... the first causes a large merge, and the second is just to make it a non-optimized index. At this point youre already at 2x of your original index size. 3) call optimize()... this will make a 3rd copy before deleting the 2nd. -Yonik http://www.lucidimagination.com Yup - finally hit me what you were talking about. Wasn't considering the case of adding docs to an existing index, not committing, and then trying to optimize. I like trying to take an opposing side from you anyway - it means I know where I will end up - but your usually so darn terse, I never know how long till I end up there. Anyway, so all you generally *need* is 2x, you just have to make sure your not adding docs first without committing them - which I was taking for granted. But means your comment of calling commit makes perfect sense. I guess you can't guarantee 2x though, as if you have queries coming in that take a while, a commit opening a new Reader will not guarantee the old Reader is quite ready to go away. Might want to wait a short bit after the commit. -- - Mark http://www.lucidimagination.com
Re: How much disk space does optimize really take
On Wed, Oct 7, 2009 at 3:56 PM, Mark Miller markrmil...@gmail.com wrote: I guess you can't guarantee 2x though, as if you have queries coming in that take a while, a commit opening a new Reader will not guarantee the old Reader is quite ready to go away. Might want to wait a short bit after the commit. Right - and in a complete system, there are other things that can also hold commit points open longer, like index replication. -Yonik http://www.lucidimagination.com