[
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932266#action_12932266
]
Earwin Burrfoot commented on LUCENE-2755:
-----------------------------------------
bq. But then you accumulate too many tiny merges, while waiting for the big one
to finish?
You say this, as if it was something terribly wrong. :)
Big merges aren't heffalumps, they don't usually stalk IW in droves. Big merge
ends sooner or later, and tiny ones go out in a flash.
bq. Maybe we should move BSMP to core and make it the default?
Dunno. The index you end up with is larger than with LogWhateverMP.
But you get a nice benefit of having roughly equal-sized big segments, which is
cool for running collection in parallel.
Everyone has his own requirements.
bq. But I don't fully understand how it chooses merges. EG does it pick
lopsided merges (where the segments differ substantially in size), as long as
they are "small" segments?
Docs say small-sized segments are treated as with LogByteSizeMP.
Another thought I had looking through the code. We have seriously inefficient
"merge conflict" resolution algorithm on our hands.
We just damn drop all new merges that have segments in common with the merges
already queued (but not yet running!!).
What does that mean?
Imagine we're producing a slew of mini-segments with decent speed and our
MergeScheduler is lagging behind:
* new seg1
* new seg2
* queue merge seg1+seg2
* start merge seg1+seg2
* new seg3
* new seg4
* queue merge seg3+seg4
* new seg5
* FAIL queue merge seg3+seg4+seg5
* new seg6
* FAIL queue merge seg3+seg4+seg5+seg6
* finish merge seg1+seg2
* start merge seg3+seg4
By that point we should really start merging of all four last segments (maybe
together with the result of seg1+seg2).
But in reality we'll merge seg3+seg4, than seg5+seg6 and then all of three
merge results together (provided no new mini-segments are added).
If we throw large merges into the loop (whether pausable or not) the situation
is amplified.
Ugly solution - when MP suggests a merge that is a strict superset of a queued,
but not yet running merge - drop the old one, use the new.
Better solution - instead of asking MP for all the merges it deems reasonable
on current index, we only ask it for "most important" one.
And we do it each time MS has an open slot for execution. This way each merge
happening is the best merge possible at that moment.
Please, correct my wrongs, if any.
> Some improvements to CMS
> ------------------------
>
> Key: LUCENE-2755
> URL: https://issues.apache.org/jira/browse/LUCENE-2755
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Shai Erera
> Assignee: Shai Erera
> Priority: Minor
> Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got
> me to read CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the
> MergeThreads taking merges from the IndexWriter until they are exhausted, and
> only then that blocked merge will run. I think it's unnecessary that that
> merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the
> default MP is LogByteSizeMP, and I hardly believe people care about doc-based
> size segments anymore, I think we should switch the default impl. There are
> two ways to make it extensible, if we want:
> ** Have an overridable member/method in CMS that you can extend and override
> - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by
> bytes, docs, calibrate deletes etc.). Better, but will need to tap into
> several places in the code, so more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to
> read and follow.
> I'll work on a patch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]