[jira] Commented: (LUCENE-2755) Some improvements to CMS

Earwin Burrfoot (JIRA) Tue, 16 Nov 2010 09:55:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932561#action_12932561
 ]


Earwin Burrfoot commented on LUCENE-2755:
-----------------------------------------

Shai:
bq. The thing is - the second round of merges may only be detected by MP after 
most if not all of the merges from the previous round ended. Therefore, if we 
dim that functionality important (and I do), we must have the MergeThreads 
query MP as well.
Don't turn things upside down. MergeThreads are best hidden from view inside 
Executor, we just feed it chunks of work in the order we want it to be done. 
It's quite enough to have a single place that queries MP and feeds Executor.

bq. About mergeInit, I took a look too and discovered a 140 lines method, so I 
doubt it does only "new segment name registration". But perhaps I'm wrong and 
those are redundant 140 lines ...
These are not redundant. But only a small number of them really need to be 
executed when a merge is added to CMS, everything else can wait until merge 
actually starts.

bq. I think that having a MS entity (or MergeExecutor) is important - it still 
gives an app the ability to override things if it wants to.
Tentatively agree on this.

bq. Also, putting the MS code inside IW will only add code to it, 
But this is not true, with all the cruft thrown away, remaining code size is 
likely on par with current that invokes MS from within IW.

bq. We define MP as the one responsible for returning the merges in the order 
it wants, and provide the necessary support by making OneMerge comparable
You contradict yourself here. If we make OneMerge comparable, we define order 
in its compareTo() method. MP can no longer return the merges in the order it 
wants. I suggest there's no Comparable, and MP does the sort itself and returns 
an ordered list of merges. Better yet - it returns only the first one of them, 
on each request.



Mike:
bq. In fact there is something wrong: without us explicitly scheduling the 
running merges (ie setting thread priorities, stopping the big merges when 
there are too many small ones), CMS will pause the incoming threads.
Why?? Indexing threads can drop merges into a queue, and forget about them. The 
blocking happens only if you explicitly want to use the same thread for 
merging, or if merging threads are lagging - in such a case you're in for 
troubles anyway.
All further analogies with OS scheduling are broken because we're not running 
"ls", we're running background jobs, and don't really care which of them blocks 
which.

{quote}
bq.  Docs say small-sized segments are treated as with LogByteSizeMP.
Hmm... but then how does this differ from setting a maxMergeMB/Docs?
{quote}
BSMP doesn't keep your segments under certain size like maxMergeMB does. It 
ensures you have exactly N 'large' segments. How large they are - depends on 
total size of the index.
It also limits the maximum count of 'small' segments, the size distribution for 
them is the same as with LBSMP.
I think the [docs|http://code.google.com/p/zoie/wiki/ZoieMergePolicy] are 
pretty good.

{quote}
This seems dangerous: what's an "important" merge?

How about, instead, we let MP return all eligible merges (like it does
today) but then we replace all previously buffered but not yet running
merges w/ the new merges it returned? Hmm but this would probably
require giving it access to the buffered-but-not-yet-running merge
set...
{quote}
We already decided that merges should be sorted? And MP gives us a bunch of 
them. Sort the bunch, pick the first one - that's your "important" merge.
By calling this MP.giveMeNextImportantMerge() method repeatedly we free 
ourselves from tons of bookkeeping you just mentioned.

Ho-ho-ho! In fact, we don't need any queues, any buffering, no hard link 
between indexing and merging - no nothing at all :)
Here's my proposal cleaned up and reiterated:

* MP has a single method - getNextMerge() (I'm not taking optimize and friends 
into account now)
** For current policies this method works as described before - MP decides on a 
list of eligible merges, sorts them by some criteria (i.e. smaller merges come 
first) and returns the first.
* MS repeatedly polls MP for merges (in one or more threads) and executes them.
** If getNextMerge() returns null, MS goes to sleep - null signifies that from 
MP's perspective index is already in ideal state.
** When some index-changing events occur, eg - indexing thread adds some docs 
and creates a new segment, MS is woken up and resumes polling/merging.
** rinse, repeat
* In CMS-like scenario, indexing and merging threads are completely decoupled 
and even lack queues that could overflow. So if some lesser merges have to wait 
for the big one - that's not going to bite you at all.
* In SMS-like scenario, polling happens within indexing thread, and everything 
works as it does now.

{quote}
Still I think the other improvements we've talked about here would be great 
steps forward.
It's just that we still need to explicitly schedule the merge threads.
{quote}
As I just described - we don't have to :)
And if you really want to preserve that monster of pausing and priority control 
(largely broken for java anyway), forget about Executors - they don't support 
this.

> Some improvements to CMS
> ------------------------
>
>                 Key: LUCENE-2755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2755
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got 
> me to read CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
> MergeThreads taking merges from the IndexWriter until they are exhausted, and 
> only then that blocked merge will run. I think it's unnecessary that that 
> merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
> default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
> size segments anymore, I think we should switch the default impl. There are 
> two ways to make it extensible, if we want:
> ** Have an overridable member/method in CMS that you can extend and override 
> - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
> bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
> several places in the code, so more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to 
> read and follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2755) Some improvements to CMS

Reply via email to