[jira] Commented: (LUCENE-2755) Some improvements to CMS

Shai Erera (JIRA) Mon, 15 Nov 2010 23:01:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932356#action_12932356
 ]


Shai Erera commented on LUCENE-2755:
------------------------------------

I have to say I totally agree w/ Earwin - in my mind, the MP should be the one 
deciding what merges to run, and in what order, the MS should be the one 
executing them. Hack, we should really call MS a MergeExecutor, since it 
doesn't really schedule anything, just does what it's told, based on the 
execution policy (parallel or blocking).

In my app, it's the MP which decides which merges to run, based on their sizes, 
time of the day and allotted time to run. I expect MS to faithfully do what 
it's told and don't play tricks on me (like pausing merges I've asked it to 
run), 'cause otherwise I'll need to write a MS too.

Another question that was brought up here is who should register the merges. 
Today there are two entities - one is the MS which repeatedly calls 
IW.getNextMerge() and in the CMS case, MergeThread does so too. The 
disadvantage of that is that IW.getNextMerge() (and mergeInit()) is called from 
two places, but the advantage is that it allows executing all of MP merges, 
even if they come in several rounds. Example, you have 4 segs of level 0 and 3 
segs of level 1 (mergeFactor=4). MP returns a single merge (4 segs level 0), 
and IW.getNextMerge() returns null 'cause there are no merges left). However, 
after completing this merge, if asked, MP will return a follow on merge of 4 
segs level 1. That will be picked up by the MergeThread that calls 
IW.getNextMerge().

The thing is - the second round of merges may only be detected by MP after most 
if not all of the merges from the previous round ended. Therefore, if we dim 
that functionality important (and I do), we must have the MergeThreads query MP 
as well.

About mergeInit, I took a look too and discovered a 140 lines method, so I 
doubt it does only "new segment name registration". But perhaps I'm wrong and 
those are redundant 140 lines ...

I think that having a MS entity (or MergeExecutor) is important - it still 
gives an app the ability to override things if it wants to. Also, putting the 
MS code inside IW will only add code to it, and I think that we should really 
start refactoring it down to smaller, more readable and focused, pieces. So I'm 
against adding more logic to it. For 3x we can choose to improve things 
internally, or leave them as they are. For 4.0 I'd suggest we do the following:
* Create a new MergeExecutor entity receives an Executor(Service?) to run with, 
but also defaults to one (like CMS is today). That replaces CMS.
** You can control the number of threads it runs with (1 for almost-SMS-like 
behavior and more for CMS).
* We create a BlockingMergeExecutor, which regardless of how many threads you 
allow it to run with, blocks until all merges finish.
** That is an improved SMS - I've always thought that blocking until merges 
finish is not related to how many threads you'd like to execute merges with. 
E.g., if you set CMS's # threads to 1, you get a sort of an SMS behavior, only 
the call is non-blocking.
* We define MP as the one responsible for returning the merges in the order it 
wants, and provide the necessary support by making OneMerge comparable (Earwin, 
MP can still sort by using a custom Comparator, we only provide a default 
comparison method for merges). That definition will be mostly in javadocs.

> Some improvements to CMS
> ------------------------
>
>                 Key: LUCENE-2755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2755
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got 
> me to read CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
> MergeThreads taking merges from the IndexWriter until they are exhausted, and 
> only then that blocked merge will run. I think it's unnecessary that that 
> merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
> default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
> size segments anymore, I think we should switch the default impl. There are 
> two ways to make it extensible, if we want:
> ** Have an overridable member/method in CMS that you can extend and override 
> - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
> bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
> several places in the code, so more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to 
> read and follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2755) Some improvements to CMS

Reply via email to