>> The problem is - each person needs his own set of knobs (or thinks he >> needs them) for MergePolicy, and I can't call any of these sets >> superior to others :/ > > I agree. I wonder tough if the knobs we give on LogMP are intuitive enough. > >> It neatly avoids uber-merges > > I didn't see that I can define what "uber-merge" is, right? Can I tell it to > stop merging segments of some size? E.g., if my index grew to 100 segments, > 40GB each, I don't think that merging 10 40GB segments (to create 400GB > segment) is going to speed up my search, for instance. A 40GB segment > (probably much less) is already big enough to not be touched anymore. No, you can't. But you can tell it to have exactly (not 'at most') N top-tier segments and try to keep their sizes close with merges. Whatever that size may be. And this is exactly what I want. And defining max cap on segment size is not what I want.
So the same set of knobs can be intuitive and meaningful for one person, and useless for another. And you can't pick the "best" one. > Will BalancedMP stop merging such segments (if all segments are of that > order of magnitude)? > > Shai > > On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot <ear...@gmail.com> wrote: >> >> Dunno, I'm quite happy with numLargeSegments (you critically >> misspelled it). It neatly avoids uber-merges, keeps the number of >> segments at bay, and does not require to recalculate thresholds when >> my expected index size changes. >> >> The problem is - each person needs his own set of knobs (or thinks he >> needs them) for MergePolicy, and I can't call any of these sets >> superior to others :/ >> >> 2011/5/2 Shai Erera <ser...@gmail.com>: >> > I did look at it, but I didn't find that it answers this particular need >> > (ending with a segment no bigger than X). Perhaps by tweaking several >> > parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can >> > achieve >> > something, but it's not very clear what is the right combination. >> > >> > Which is related to one of the points -- is it not more intuitive for an >> > app >> > to set this threshold (if it needs any thresholds), than tweaking all of >> > those parameters? If so, then we only need two thresholds (size + >> > mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic >> > (perhaps w/ some adaptations) to derive a merge plan. >> > >> > Shai >> > >> > On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot <ear...@gmail.com> >> > wrote: >> >> >> >> Have you checked BalancedSegmentMergePolicy? It has some more knobs :) >> >> >> >> On Mon, May 2, 2011 at 17:03, Shai Erera <ser...@gmail.com> wrote: >> >> > Hi >> >> > >> >> > Today, LogMP allows you to set different thresholds for segments >> >> > sizes, >> >> > thereby allowing you to control the largest segment that will be >> >> > considered for merge + the largest segment your index will hold (=~ >> >> > threshold * mergeFactor). >> >> > >> >> > So, if you want to end up w/ say 20GB segments, you can set >> >> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. >> >> > >> >> > However, this often does not achieve your desired goal -- if the >> >> > index >> >> > contains 5 and 7 GB segments, they will never be merged b/c they are >> >> > bigger than the threshold. I am willing to spend the CPU and IO >> >> > resources >> >> > to end up w/ 20 GB segments, whether I'm merging 10 segments together >> >> > or >> >> > only 2. After I reach a 20GB segment, it can rest peacefully, at >> >> > least >> >> > until I increase the threshold. >> >> > >> >> > So I wonder, first, if this threshold (i.e., largest segment size you >> >> > would like to end up with) is more natural to set than thee current >> >> > thresholds, >> >> > from the application level? I.e., wouldn't it be a simpler threshold >> >> > to >> >> > set >> >> > instead of doing weird calculus that depend on >> >> > maxMergeMB(ForOptimize) >> >> > and mergeFactor? >> >> > >> >> > Second, should this be an addition to LogMP, or a different >> >> > type of MP. One that adheres to only those two factors (perhaps the >> >> > segSize threshold should be allowed to set differently for optimize >> >> > and >> >> > regular merges). It can pick segments for merge such that it >> >> > maximizes >> >> > the result segment size (i.e., don't necessarily merge in sequential >> >> > order), but not more than mergeFactor. >> >> > >> >> > I guess, if we think that maxResultSegmentSizeMB is more intuitive >> >> > than >> >> > the current thresholds, application-wise, then this change should go >> >> > into LogMP. Otherwise, it feels like a different MP is needed, >> >> > because >> >> > LogMP is already complicated and another threshold would confuse >> >> > things. >> >> > >> >> > What do you think of this? Am I trying to optimize too much? :) >> >> > >> >> > Shai >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Kirill Zakharenko/Кирилл Захаренко >> >> E-Mail/Jabber: ear...@gmail.com >> >> Phone: +7 (495) 683-567-4 >> >> ICQ: 104465785 >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> > >> > >> >> >> >> -- >> Kirill Zakharenko/Кирилл Захаренко >> E-Mail/Jabber: ear...@gmail.com >> Phone: +7 (495) 683-567-4 >> ICQ: 104465785 >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org