[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935327#action_12935327
 ] 

Shai Erera commented on LUCENE-2755:
------------------------------------

There are several points addressed by this issue:
* Refactor IW, MS and MP so that MS pulls merges directly from MP, instead from 
IW.
* Rewrite CMS to take advantage of ThreadPoolExecutor instead of managing the 
threads on its own, in addition to using a blocking queue instead of us coding 
the blocking directly.
* One should be able to reuse CMS across several IndexWriters, which is not 
possible today, to e.g., allow one controlling the total # of merges happening 
in the JVM.
* Merges should be sorted by their size in bytes and not by their # of docs -- 
or actually, merges should be sorted by a criteria the MP determines.

All the while maintaining the following requirements, in no particular order:
* MergeThreads' priority needs to be controllable - this is the current 
behavior of CMS that we'd like to keep.
* When there are too many merges to execute, small ones should be preferred to 
large ones, and we need the ability to pause large merges in favor of small 
ones.
* The user needs to be able to control:
*# The max number of running merges
*# The max number of merges, above which scheduling more merges should be 
blocked.
* We should keep the sync() API, which lets the user wait for all scheduled 
merges to complete.
* The MP needs to be aware of the type of merges that are requested (regular, 
optimize, expunge).
* The user should be able to fast-close the index, aborting all merges (running 
and pending).
* If there are cascading merges (i.e., a result of several other merges), they 
should all be executed following the call to MS.merge() -- that is, it could be 
that CMS itself, or its MergeThreads will encounter merges not returned by MP 
at first, but as a subsequent round due to changes done to the index.

After investigating the code and going over the proposed plan, I feel that we 
cannot accomplish all that we'd like to do, given the above requirements. And 
just to be clear, those are not *any* application requirements, but the default 
ones we'd like Lucene to offer OOtB. One can still write a MS/MP which doesn't 
guarantee all that.

Using ThreadPoolExecutor looks like will only complicate CMS instead of 
simplifying it:
# Because of all the requirements I've listed above, we'd need to trick TPE 
into starting more threads than we intend to run, while we pause some of them.
# In order to set threads' priorities, we need to write our own ThreadFactory 
to pass to TPE, and inside it keep track of the allocated Threads and those 
that still run, so that we can control their priority.
# There's no trivial way to impl sync(), as TPE does not provide API we can 
rely on (e.g. checking when all threads are done). There are ways to impl that, 
using Semaphore (see next bullet).
# In order to block the app on too many merges being scheduled, we'd need to 
use a Semaphore, because even if we use a BlockingQueue w/ TPE, the submit() 
call won't block, but simply reject the item.
# In order to execute cascading merges as well, we'd need the TPE threads (or 
the Runnable we submit) to poll getNextMerge() until null is returned. This 
breaks the concepts of Executors, where a task is submitted, done and that's 
it. I wouldn't mind if we did it still though.

All of these make me think that TPE, at this point at least, is not suitable 
for CMS. While it's doable, it's not going to make CMS code any simpler. And it 
looks more as if we'll enforce TPE in CMS, than it is really useful.

Refactoring IW, MS and MP seems to not simplify anything, really:
# We'd still need IW telling MP which merges are needed (optimize, regular or 
expunge), so the three findMergesForXYZ will need to remain.
# The proposal will add a getNextMerge() to MP, instead of IW, which IMO will 
only complicate matters for MP implementers. E.g., what should MP do if 
findRegularMerges was called, then getNext() was called and then 
findOptimizeMerges is called? It's not a critical decision we leave in the MP 
developers, but IMO it's unnecessary. Today MP is a stateless object - it 
receives SegmentInfos and returns a MergeSpec. It doesn't need to 'remember' 
anything. But if we move the getNextMerge() to it, we make it stateful, for no 
good reasons
# We don't really take IW outside the loop really - it would still need to 
instruct MP which merges to 'prepare', so that MS can take.
# We'd need to introduce an abort/cancel() API on MS, which adds another 
responsibility for MS, but doesn't remove much from IW.

All in all, I don't think this refactoring simplifies IW-MS-MP communication a 
lot or allows custom MS and MPs have more control over what's going on. IW, as 
the mediator, is nothing but the mediator, which happens to know (as it should) 
which merges finish and if the state of the index changed, ask MP for more 
merges. That that it keeps track of pending and running merges, to me, is not a 
big deal. In fact, due to IW.waitForMerges() it either must continue to keep 
track of pending/running merges, or we add a sync() API to every MS.

Fixing CMS to allow sharing across multiple IndexWriter instances is important 
IMO, so I'll look into fixing it. To allow for MP dependent sort, I suggest we 
add to MP a getMergesComparator and use it in CMS. The default (to not break 
back-compat) can be ByDocsComparator and we override it in the existing MPs.

I must say that the more I went over the details, the more I was convinced that 
the proposals made will change the current API for no great benefits. But I may 
have looked too deeply into the impl, that I lost the ability to think about it 
'from above' - so I'd appreciate if someone can go over what I wrote and offer 
comments :-).

> Some improvements to CMS
> ------------------------
>
>                 Key: LUCENE-2755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2755
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> While running optimize on a large index, I've noticed several things that got 
> me to read CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
> MergeThreads taking merges from the IndexWriter until they are exhausted, and 
> only then that blocked merge will run. I think it's unnecessary that that 
> merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
> default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
> size segments anymore, I think we should switch the default impl. There are 
> two ways to make it extensible, if we want:
> ** Have an overridable member/method in CMS that you can extend and override 
> - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
> bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
> several places in the code, so more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to 
> read and follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to