Re: MergePolicy Thresholds
Thanks Tom! Sounds like great fun working with such massive data sets :) Mike http://blog.mikemccandless.com On Fri, May 20, 2011 at 7:03 PM, Burton-West, Tom tburt...@umich.edu wrote: Hi Mike and Shai, I was able to index a few documents with the tieredMergePolicy but I was hoping to build a large test index of about 700,000 documents to compare the performance against our previous runs. I was hoping I would be able to report on my results in time for the Lucene Revolution conference. Unfortunately there was a power outage at our data center last week which resulted in a node failure in one of our storage nodes and node rebalancing for a cluster of 500 terabytes takes quite a while and totally messes up performance measurements. (Our 6-8 terabytes of large scale search indexes shares storage with the repository that holds the 480+ terabytes of page images and metadata for the 8 million+ books). Hopefully I will be able to run the tests when I get back. Tom From: Burton-West, Tom [mailto:tburt...@umich.edu] Sent: Monday, May 09, 2011 4:10 PM To: dev@lucene.apache.org Subject: RE: MergePolicy Thresholds Thanks again Shai and Mike. Am in the process of downloading and building r108. Should be able to build a test index sometime this week. I’ll make some guesses on what parameters to use based on our previous tests. Tom From: Shai Erera [mailto:ser...@gmail.com] Sent: Saturday, May 07, 2011 11:33 PM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Hey Tom, Mike back-ported the changes to 3x, so you can try it out. FYI, Shai On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Shai and Mike! I'll keep an eye on LUCENE-1076. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, May 03, 2011 11:15 AM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Thanks Shai! I'm way behind on my 3.x backports -- I'll try to do this soon. Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote: I uploaded a patch to LUCENE-1076. Tom, apparently the patch I've attached before cannot be used, because there are dependencies (in earlier commits on LUCENE-1076) that need to be back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use this new MP. Shai On Tue, May 3, 2011 at 1:00 PM, Michael McCandless luc...@mikemccandless.com wrote: That'd be great, thanks :) Yes, let's iterate on the issue! But: it should still be open, I hope (I didn't mean to close it yet, since it's not back ported)... Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote: Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: MergePolicy Thresholds
Hi Mike and Shai, I was able to index a few documents with the tieredMergePolicy but I was hoping to build a large test index of about 700,000 documents to compare the performance against our previous runs. I was hoping I would be able to report on my results in time for the Lucene Revolution conference. Unfortunately there was a power outage at our data center last week which resulted in a node failure in one of our storage nodes and node rebalancing for a cluster of 500 terabytes takes quite a while and totally messes up performance measurements. (Our 6-8 terabytes of large scale search indexes shares storage with the repository that holds the 480+ terabytes of page images and metadata for the 8 million+ books). Hopefully I will be able to run the tests when I get back. Tom From: Burton-West, Tom [mailto:tburt...@umich.edu] Sent: Monday, May 09, 2011 4:10 PM To: dev@lucene.apache.org Subject: RE: MergePolicy Thresholds Thanks again Shai and Mike. Am in the process of downloading and building r108. Should be able to build a test index sometime this week. I'll make some guesses on what parameters to use based on our previous tests. Tom From: Shai Erera [mailto:ser...@gmail.com] Sent: Saturday, May 07, 2011 11:33 PM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Hey Tom, Mike back-ported the changes to 3x, so you can try it out. FYI, Shai On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom tburt...@umich.edumailto:tburt...@umich.edu wrote: Thanks Shai and Mike! I'll keep an eye on LUCENE-1076. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.commailto:luc...@mikemccandless.com] Sent: Tuesday, May 03, 2011 11:15 AM To: dev@lucene.apache.orgmailto:dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Thanks Shai! I'm way behind on my 3.x backports -- I'll try to do this soon. Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.commailto:ser...@gmail.com wrote: I uploaded a patch to LUCENE-1076. Tom, apparently the patch I've attached before cannot be used, because there are dependencies (in earlier commits on LUCENE-1076) that need to be back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use this new MP. Shai On Tue, May 3, 2011 at 1:00 PM, Michael McCandless luc...@mikemccandless.commailto:luc...@mikemccandless.com wrote: That'd be great, thanks :) Yes, let's iterate on the issue! But: it should still be open, I hope (I didn't mean to close it yet, since it's not back ported)... Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.commailto:ser...@gmail.com wrote: Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.commailto:luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.commailto:ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.orgmailto:dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.orgmailto:dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr
Re: MergePolicy Thresholds
Hey Tom, Mike back-ported the changes to 3x, so you can try it out. FYI, Shai On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Shai and Mike! I'll keep an eye on LUCENE-1076. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, May 03, 2011 11:15 AM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Thanks Shai! I'm way behind on my 3.x backports -- I'll try to do this soon. Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote: I uploaded a patch to LUCENE-1076. Tom, apparently the patch I've attached before cannot be used, because there are dependencies (in earlier commits on LUCENE-1076) that need to be back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use this new MP. Shai On Tue, May 3, 2011 at 1:00 PM, Michael McCandless luc...@mikemccandless.com wrote: That'd be great, thanks :) Yes, let's iterate on the issue! But: it should still be open, I hope (I didn't mean to close it yet, since it's not back ported)... Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote: Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Shai On Mon, May 2, 2011 at 9:41 PM, Burton-West, Tom tburt...@umich.edu wrote: Hi Shai and Mike, Testing the TieredMP on our large indexes has been on my todo list since I read Mikes blog post http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html . If you port it to the 3.x branch Shai, I'll be more than happy to test it with our very large (300GB+) indexes. Besides being able to set the max merged segment size, I'm especially interested in using the maxSegmentsPerTier parameter. From Mike's blog post: ...maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, May 02, 2011 2:19 PM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds I think it should be an easy port... Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote: Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any way, or do you think it can easily be ported to 3x? Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org tieredmp.patch Description: Binary data - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
That'd be great, thanks :) Yes, let's iterate on the issue! But: it should still be open, I hope (I didn't mean to close it yet, since it's not back ported)... Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote: Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Thanks Shai! I'm way behind on my 3.x backports -- I'll try to do this soon. Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote: I uploaded a patch to LUCENE-1076. Tom, apparently the patch I've attached before cannot be used, because there are dependencies (in earlier commits on LUCENE-1076) that need to be back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use this new MP. Shai On Tue, May 3, 2011 at 1:00 PM, Michael McCandless luc...@mikemccandless.com wrote: That'd be great, thanks :) Yes, let's iterate on the issue! But: it should still be open, I hope (I didn't mean to close it yet, since it's not back ported)... Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote: Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: MergePolicy Thresholds
Thanks Shai and Mike! I'll keep an eye on LUCENE-1076. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, May 03, 2011 11:15 AM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds Thanks Shai! I'm way behind on my 3.x backports -- I'll try to do this soon. Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote: I uploaded a patch to LUCENE-1076. Tom, apparently the patch I've attached before cannot be used, because there are dependencies (in earlier commits on LUCENE-1076) that need to be back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use this new MP. Shai On Tue, May 3, 2011 at 1:00 PM, Michael McCandless luc...@mikemccandless.com wrote: That'd be great, thanks :) Yes, let's iterate on the issue! But: it should still be open, I hope (I didn't mean to close it yet, since it's not back ported)... Mike http://blog.mikemccandless.com On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote: Mike, if you want, I can back-port it, as I've already started this when preparing the patch. I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on 3x too? It'll be a backwards change. Maybe we should iterate on the issue? I can reopen. Shai On Tue, May 3, 2011 at 12:36 PM, Michael McCandless luc...@mikemccandless.com wrote: Looks good Shai! Comments below too: On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote: Hi I looked into porting it to 3x, and prepared the attached patch. It only contains the new TieredMP and Test, as well as the necessary changes to LuceneTestCase and IndexWriter. I guess you can start with it (even just the MP and IW changes) to test it on your indexes. Mike, I saw that there were many more changes, as part of LUCENE-1076, done to the code. In particular, this MP is now the default (on trunk), so I guess many changes (to tests) were needed because of that. Do you remember, if apart from the changes I've included in the patch, other important changes w.r.t. this code? The only other changes I can think of were some verbosity improvements to IndexWriter, to support the python script that can make a merge movie from an infoStream output; but that can wait for when I back-port to 3.x... As we won't change the default MP on 3x, I'm guessing I don't need to port all the changes to 3x. Right, I think. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Dunno, I'm quite happy with numLargeSegments (you critically misspelled it). It neatly avoids uber-merges, keeps the number of segments at bay, and does not require to recalculate thresholds when my expected index size changes. The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ 2011/5/2 Shai Erera ser...@gmail.com: I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ I agree. I wonder tough if the knobs we give on LogMP are intuitive enough. It neatly avoids uber-merges I didn't see that I can define what uber-merge is, right? Can I tell it to stop merging segments of some size? E.g., if my index grew to 100 segments, 40GB each, I don't think that merging 10 40GB segments (to create 400GB segment) is going to speed up my search, for instance. A 40GB segment (probably much less) is already big enough to not be touched anymore. Will BalancedMP stop merging such segments (if all segments are of that order of magnitude)? Shai On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote: Dunno, I'm quite happy with numLargeSegments (you critically misspelled it). It neatly avoids uber-merges, keeps the number of segments at bay, and does not require to recalculate thresholds when my expected index size changes. The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ 2011/5/2 Shai Erera ser...@gmail.com: I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Actually the new TieredMergePolicy (only on trunk currently but I plan to backport for 3.2) lets you set the max merged segment size (maxMergedSegmentMB). It's only an estimate, but if it's set, it tries to pick a merge reaching around that target size. Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any way, or do you think it can easily be ported to 3x? Shai On Mon, May 2, 2011 at 6:34 PM, Michael McCandless luc...@mikemccandless.com wrote: Actually the new TieredMergePolicy (only on trunk currently but I plan to backport for 3.2) lets you set the max merged segment size (maxMergedSegmentMB). It's only an estimate, but if it's set, it tries to pick a merge reaching around that target size. Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
I think it should be an easy port... Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote: Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any way, or do you think it can easily be ported to 3x? Shai On Mon, May 2, 2011 at 6:34 PM, Michael McCandless luc...@mikemccandless.com wrote: Actually the new TieredMergePolicy (only on trunk currently but I plan to backport for 3.2) lets you set the max merged segment size (maxMergedSegmentMB). It's only an estimate, but if it's set, it tries to pick a merge reaching around that target size. Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: MergePolicy Thresholds
Hi Shai and Mike, Testing the TieredMP on our large indexes has been on my todo list since I read Mikes blog post http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html. If you port it to the 3.x branch Shai, I'll be more than happy to test it with our very large (300GB+) indexes. Besides being able to set the max merged segment size, I'm especially interested in using the maxSegmentsPerTier parameter. From Mike's blog post: ...maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, May 02, 2011 2:19 PM To: dev@lucene.apache.org Subject: Re: MergePolicy Thresholds I think it should be an easy port... Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote: Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any way, or do you think it can easily be ported to 3x? Shai - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: MergePolicy Thresholds
The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ I agree. I wonder tough if the knobs we give on LogMP are intuitive enough. It neatly avoids uber-merges I didn't see that I can define what uber-merge is, right? Can I tell it to stop merging segments of some size? E.g., if my index grew to 100 segments, 40GB each, I don't think that merging 10 40GB segments (to create 400GB segment) is going to speed up my search, for instance. A 40GB segment (probably much less) is already big enough to not be touched anymore. No, you can't. But you can tell it to have exactly (not 'at most') N top-tier segments and try to keep their sizes close with merges. Whatever that size may be. And this is exactly what I want. And defining max cap on segment size is not what I want. So the same set of knobs can be intuitive and meaningful for one person, and useless for another. And you can't pick the best one. Will BalancedMP stop merging such segments (if all segments are of that order of magnitude)? Shai On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote: Dunno, I'm quite happy with numLargeSegments (you critically misspelled it). It neatly avoids uber-merges, keeps the number of segments at bay, and does not require to recalculate thresholds when my expected index size changes. The problem is - each person needs his own set of knobs (or thinks he needs them) for MergePolicy, and I can't call any of these sets superior to others :/ 2011/5/2 Shai Erera ser...@gmail.com: I did look at it, but I didn't find that it answers this particular need (ending with a segment no bigger than X). Perhaps by tweaking several parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve something, but it's not very clear what is the right combination. Which is related to one of the points -- is it not more intuitive for an app to set this threshold (if it needs any thresholds), than tweaking all of those parameters? If so, then we only need two thresholds (size + mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic (perhaps w/ some adaptations) to derive a merge plan. Shai On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote: Have you checked BalancedSegmentMergePolicy? It has some more knobs :) On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote: Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor). So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: