[jira] Commented: (LUCENE-2755) Some improvements to CMS

2011-01-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983359#action_12983359
 ] 

Robert Muir commented on LUCENE-2755:
-

Mike, fyi it looks like we are hung again in hudson:
https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/3866/

Not sure if its the same deadlock you found.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2755.patch


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2011-01-17 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982564#action_12982564
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

bq. if you still want to work on it, the I can keep the issue open and mark it 
3.2 (unless you want to give it a try in 3.1). 
I'll start another later, so please, go on.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2755.patch


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-12-02 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966029#action_12966029
 ] 

Shai Erera commented on LUCENE-2755:


Getting rid of the IndexWriter member in CMS is not trivial w/o API change. The 
IndexWriter member is used for verbosing purposes, which is accessed by some 
public/protected API, like sync() and doMerge(). So on 3x it'd mean to 
deprecate methods, which IMO does not justify it. On trunk it's easier.

On the other hand, we can consider two things:
# Make IndexWriter ThreadLocal -- the thread who calls merge() will own its 
ThreadLocal, and if different threads index to different indexes, then it 
should work. But it won't help in case the indexing threads are taken from a 
pool.
# Think whether we want CMS to be shareable across several IndexWriters at all. 
I haven't heard that requirement coming up on the list, and definitely if 
someone attempted to do it, things would break, so I guess no one really does 
it. Therefore maybe we should leave it to the users to develop something like 
that on their own, and maybe even contribute back. A MS which might even be 
simplified by not implementing all of CMS functionality today (controlling 
threads' priority, pause merge threads etc.).

bq, The proper route is to take a handful of dirt and sticks and slap together 
some working code to illustrate my point. And that's what I'm gonna do.

It'd be great if you will do that ! Sometimes it's indeed easier to fight over 
a concrete example then theoretical can and can't work arguments.

bq. MP will recieve SegmentInfos and return OneMerge.

From whom will it receive it? In the case of cascading merges, the merge 
threads need to continuously pull MP for getNextMerge(MergeType), but they 
don't have the global picture IW holds about the existing segments 
(SegmentInfos). Also, IW keeps track of the segments that existed when you 
first called optimize() and doesn't allow the cascading merges to include 
segments that didn't exist at the time. Who will do that accounting now?

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-24 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935327#action_12935327
 ] 

Shai Erera commented on LUCENE-2755:


There are several points addressed by this issue:
* Refactor IW, MS and MP so that MS pulls merges directly from MP, instead from 
IW.
* Rewrite CMS to take advantage of ThreadPoolExecutor instead of managing the 
threads on its own, in addition to using a blocking queue instead of us coding 
the blocking directly.
* One should be able to reuse CMS across several IndexWriters, which is not 
possible today, to e.g., allow one controlling the total # of merges happening 
in the JVM.
* Merges should be sorted by their size in bytes and not by their # of docs -- 
or actually, merges should be sorted by a criteria the MP determines.

All the while maintaining the following requirements, in no particular order:
* MergeThreads' priority needs to be controllable - this is the current 
behavior of CMS that we'd like to keep.
* When there are too many merges to execute, small ones should be preferred to 
large ones, and we need the ability to pause large merges in favor of small 
ones.
* The user needs to be able to control:
*# The max number of running merges
*# The max number of merges, above which scheduling more merges should be 
blocked.
* We should keep the sync() API, which lets the user wait for all scheduled 
merges to complete.
* The MP needs to be aware of the type of merges that are requested (regular, 
optimize, expunge).
* The user should be able to fast-close the index, aborting all merges (running 
and pending).
* If there are cascading merges (i.e., a result of several other merges), they 
should all be executed following the call to MS.merge() -- that is, it could be 
that CMS itself, or its MergeThreads will encounter merges not returned by MP 
at first, but as a subsequent round due to changes done to the index.

After investigating the code and going over the proposed plan, I feel that we 
cannot accomplish all that we'd like to do, given the above requirements. And 
just to be clear, those are not *any* application requirements, but the default 
ones we'd like Lucene to offer OOtB. One can still write a MS/MP which doesn't 
guarantee all that.

Using ThreadPoolExecutor looks like will only complicate CMS instead of 
simplifying it:
# Because of all the requirements I've listed above, we'd need to trick TPE 
into starting more threads than we intend to run, while we pause some of them.
# In order to set threads' priorities, we need to write our own ThreadFactory 
to pass to TPE, and inside it keep track of the allocated Threads and those 
that still run, so that we can control their priority.
# There's no trivial way to impl sync(), as TPE does not provide API we can 
rely on (e.g. checking when all threads are done). There are ways to impl that, 
using Semaphore (see next bullet).
# In order to block the app on too many merges being scheduled, we'd need to 
use a Semaphore, because even if we use a BlockingQueue w/ TPE, the submit() 
call won't block, but simply reject the item.
# In order to execute cascading merges as well, we'd need the TPE threads (or 
the Runnable we submit) to poll getNextMerge() until null is returned. This 
breaks the concepts of Executors, where a task is submitted, done and that's 
it. I wouldn't mind if we did it still though.

All of these make me think that TPE, at this point at least, is not suitable 
for CMS. While it's doable, it's not going to make CMS code any simpler. And it 
looks more as if we'll enforce TPE in CMS, than it is really useful.

Refactoring IW, MS and MP seems to not simplify anything, really:
# We'd still need IW telling MP which merges are needed (optimize, regular or 
expunge), so the three findMergesForXYZ will need to remain.
# The proposal will add a getNextMerge() to MP, instead of IW, which IMO will 
only complicate matters for MP implementers. E.g., what should MP do if 
findRegularMerges was called, then getNext() was called and then 
findOptimizeMerges is called? It's not a critical decision we leave in the MP 
developers, but IMO it's unnecessary. Today MP is a stateless object - it 
receives SegmentInfos and returns a MergeSpec. It doesn't need to 'remember' 
anything. But if we move the getNextMerge() to it, we make it stateful, for no 
good reasons
# We don't really take IW outside the loop really - it would still need to 
instruct MP which merges to 'prepare', so that MS can take.
# We'd need to introduce an abort/cancel() API on MS, which adds another 
responsibility for MS, but doesn't remove much from IW.

All in all, I don't think this refactoring simplifies IW-MS-MP communication a 
lot or allows custom MS and MPs have more control over what's going on. IW, as 
the mediator, is nothing but the mediator, which happens to know (as it should) 
which merges finish and if the state 

[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-24 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935400#action_12935400
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

bq. Refactor IW, MS and MP so that MS pulls merges directly from MP, instead 
from IW.
Directly or through IW - this is not important. Important point is pulling 
merges one-by-one, when you have the resources to execute them.

bq. Rewrite CMS to take advantage of ThreadPoolExecutor instead of managing the 
threads on its own, in addition to using a blocking queue instead of us coding 
the blocking directly.
bq. Using ThreadPoolExecutor looks like will only complicate CMS instead of 
simplifying it:
I ended up with same conclusion, while taking my first stabs. But for different 
reasons.
The philosphy of Executors is that you schedule (push) a number of tasks, and 
then some magic black box runs them for you, resolving threading issues itself.
My suggestion requires pulling tasks when computing resources become available, 
and that doesn't map on scheduling model at all.
All priority/pausing/breaking issues are largely irrelevant.

bq. MergeThreads' priority needs to be controllable, and we need the ability to 
pause large merges in favor of small ones
These, and the likes - are not requirements.
These are but one of the possible solutions to our real requirements, which 
look like
* don't run out of file handles on fast indexation
* don't degrade search performance and NRT turnaround
* don't kill the disk with too much random IOs.

bq. If there are cascading merges (i.e., a result of several other merges), 
they should all be executed following the call to MS.merge() - that is, it 
could be that CMS itself, or its MergeThreads will encounter merges not 
returned by MP at first, but as a subsequent round due to changes done to the 
index.
This is trivially solved with my pulling model. We pull until nothing is left. 
Period. Instead of getting batches of merges from MP and then reconciling them 
with reality we do the same operation over and over again, until MP is 
satisfied - very simple.

bq. The proposal will add a getNextMerge() to MP, instead of IW, which IMO will 
only complicate matters for MP implementers. E.g., what should MP do if 
findRegularMerges was called, then getNext() was called and then 
findOptimizeMerges is called? It's not a critical decision we leave in the MP 
developers, but IMO it's unnecessary. Today MP is a stateless object - it 
receives SegmentInfos and returns a MergeSpec. It doesn't need to 'remember' 
anything. But if we move the getNextMerge() to it, we make it stateful, for no 
good reasons
bq. We don't really take IW outside the loop really - it would still need to 
instruct MP which merges to 'prepare', so that MS can take.
There will be, most probably, getNext(Normal/Optimize/Expunge)Merge() methods. 
findWhatever methods will be removed, noone needs to call them, so - no state, 
no 'preparations'.
MP will recieve SegmentInfos and return OneMerge.

bq. To allow for MP dependent sort, I suggest we add to MP a 
getMergesComparator and use it in CMS.
MP should return merges sorted, that's all. Why do you need to expose its 
Comparator or whatever it uses for sorting?


Whatever I didn't mention from your post - I either missed, or agree with :)
I think I'll stop trying to explain it in Jira comments. It took great time 
discussing everything with Mike over IRC, and here it'll take ages.
The proper route is to take a handful of dirt and sticks and slap together some 
working code to illustrate my point. And that's what I'm gonna do.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 

[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-17 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932864#action_12932864
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

{quote}
If we proceed w/ your proposal, that is basically the MS/ME polling MP, and not 
IW doing so, how would IW know about the running merges and pending ones? Today 
IW tracks those two lists so that if you need to abort merges, it knows which 
ones to abort.

We can workaround aborting the running merges by introducing a MS.abort()-like 
method. But what about MP? Now the lists are divided between too entities (MP 
and MS), and aborting a MP does not make sense (doable, but I don't think it 
belongs there). 
{quote}
There are no lists at all with my approach. At least no pending list, that 
one gets recalculated each time we poll MP and it never gets out, neither gets 
stored inside.
There's a kind of implicit in flight list - MS has the knowledge of its 
threads that are currently doing things. And if you want to go around aborting 
things, MS is probably the right place to do this.

bq. Maybe we can have MS.abort() poll MP for next merges until it returns null, 
and throwing all the returned ones away - that can be done.
So, just I said - that's not needed. MP is empty, it has no state.

bq. Should we, in the scope of this issue, make IW a required settable 
parameter on MS, like we do w/ MP?
For the love of God, no. I'd like to see it removed from MP too.
It's only natural to pass the same instance of Policy or Scheduler to different 
Writers, so they have the same behaviour and share Scheduler resources 
(insanely important if you have fifteen indexes like I do and don't want them 
to rape hardware with fifteen simultaneous merges).
It is against the nature to pass Writer to Policy. Does the Policy need to 
write anything on its own, when it decides to? No. It should advice, not act.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-17 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932982#action_12932982
 ] 

Shai Erera commented on LUCENE-2755:


Earwin, the way CMS currently handles the writer instance makes it entirely not 
thread-safe. If you e.g. pass different writers to merge(), the class member 
changes, and MTs will start merging other segments, and in the worse case 
attempt to merge segments of a different writer.

I myself thinks it's ok to have a MP and MS per writer, but I don't have too 
strong feelings for/against it - so if we want to allow this, we should fix CMS.

As for the other comments, I'll need to check more closely what IW does w/ 
those merges - as it checks all sorts of things (e.g. whether it's an optimize 
merge or not, see one of the latest bugs Mike resolved). So getting it entirely 
outside of IndexWriter and into MP/MS is risky - at least, I don't understand 
the code well enough (yet) to say whether it's doable at all and if we don't 
miss something.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932479#action_12932479
 ] 

Michael McCandless commented on LUCENE-2755:


{quote}
bq. But then you accumulate too many tiny merges, while waiting for the big one 
to finish?

You say this, as if it was something terribly wrong. 
Big merges aren't heffalumps, they don't usually stalk IW in droves. Big merge 
ends sooner or later, and tiny ones go out in a flash.
{quote}

In fact there is something wrong: without us explicitly scheduling the
running merges (ie setting thread priorities, stopping the big merges
when there are too many small ones), CMS will pause the incoming
threads.

I first saw this happen when testing our NRT reopen perf (which is
merge intensive).  Normally the turnaround is very fast (eg 5 msec)
but if you're at your max merge count then CMS will stall you and the
turnaround time easily becomes seconds, which is awful.

It's like an OS that refuses to schedule your ls command because
there's still some long running process...

CMS's explicit thread scheduling fixes that problem -- a big merge no
longer causes seconds of delay in opening a new NRT reader.  That is,
as long as net/net you've allocated enough CPUs (maxThreadCount) to do
the merging.  If merging is too slow vs indexing rate + reopen rate
then there's no hope: at some point reopen must be blocked (it's
a zero sum game).

In the ideal world the OS/JRE would do a better job scheduling, ie
realize that there are looong running threads and down prioritize them
(like the OS does to processes), but in practice it's nowhere close to
doing this right for our use case, so I don't see a choice here.  If
we want to keep our fast NRT reopen time, we have to manually schedule
our merge threads.

{quote}
bq. Maybe we should move BSMP to core and make it the default?

Dunno. The index you end up with is larger than with LogWhateverMP.
But you get a nice benefit of having roughly equal-sized big segments, which is 
cool for running collection in parallel.
Everyone has his own requirements.
{quote}

Fair enough...

{quote}
bq. But I don't fully understand how it chooses merges. EG does it pick 
lopsided merges (where the segments differ substantially in size), as long as 
they are small segments?

Docs say small-sized segments are treated as with LogByteSizeMP.
{quote}

Hmm... but then how does this differ from setting a maxMergeMB/Docs?

bq. We have seriously inefficient merge conflict resolution algorithm on our 
hands.

You're right!  Though, I think this typically isn't a problem for
LogMP since it doesn't normally pick future merges that conflict with
past ones.

Also the problem is bounded by how long the merge takes to finish.

But I agree we should try to fix this... other MPs could conceivably
do this.

{quote}
Ugly solution - when MP suggests a merge that is a strict superset of a queued, 
but not yet running merge - drop the old one, use the new.
Better solution - instead of asking MP for all the merges it deems reasonable 
on current index, we only ask it for most important one.
And we do it each time MS has an open slot for execution. This way each merge 
happening is the best merge possible at that moment.
{quote}

This seems dangerous: what's an important merge?

How about, instead, we let MP return all eligible merges (like it does
today) but then we replace all previously buffered but not yet running
merges w/ the new merges it returned?  Hmm but this would probably
require giving it access to the buffered-but-not-yet-running merge
set...


bq. About mergeInit, I took a look too and discovered a 140 lines method, so I 
doubt it does only new segment name registration. But perhaps I'm wrong and 
those are redundant 140 lines ...

Right -- it's doing lots of stuff.  But the question was whether CMS
really must be calling it itself vs leaving IW.merge to call it as SMS
does.  Ideally only IW.merge should call it (and it becomes private),
which if we take the name assignment out earlier seems feasible.


 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since 

[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-16 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932486#action_12932486
 ] 

Shai Erera commented on LUCENE-2755:


bq. Ideally only IW.merge should call it (and it becomes private),

I wouldn't make it private. If I remember correctly, the Parallel Index 
overrode that method to synchronize merges across all parallels.

bq. but if you're at your max merge count then CMS will stall you and the 
turnaround time easily becomes seconds, which is awful.

But Mike, if you hit your maxMergeCount with large merges, then you won't run 
tiny merges at all. It's only if you have room to run any merges, that this 
'pausing' actually helps. I trust you when you say you've observed that not 
pausing those merges hurt performance, but I wonder in real life, how often 
does that happen, and whether we should incorporate that in our code. If it's a 
rare case, then perhaps apps that hit it should use another MS which pauses its 
threads?

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932539#action_12932539
 ] 

Michael McCandless commented on LUCENE-2755:



{quote}
bq. Ideally only IW.merge should call it (and it becomes private),

I wouldn't make it private. If I remember correctly, the Parallel Index 
overrode that method to synchronize merges across all parallels.
{quote}

Ahh OK.

bq. But Mike, if you hit your maxMergeCount with large merges, then you won't 
run tiny merges at all.

Sure, but that's uncommon.  Ie, large merges don't happen very
frequently.

bq. It's only if you have room to run any merges, that this 'pausing' actually 
helps. I trust you when you say you've observed that not pausing those merges 
hurt performance, but I wonder in real life, how often does that happen, and 
whether we should incorporate that in our code. If it's a rare case, then 
perhaps apps that hit it should use another MS which pauses its threads?

Remember it's not just pausing.  We also set thread priorities so that
smaller merges run with higher priority, and, all merges run with
higher priority than the indexing threads (by default).

I don't think this is rare because eventually (assuming your index is
big enough) you'll hit a large merge and then you can fairly easily
see the merges stack up.  I've seen merges stack up in the non-NRT
case too.  Without this explicit thread scheduling we do, that large
merge can easily kill your NRT reopens, ie take many seconds to get a
new reader.  This is non-graceful degradation because at first NRT
reopen time looks great but then as your index grows and you hit a
large merge, suddenly it's many seconds.

If your app has costly merges (eg you store fields, term vectors, and
you use dynamic fields which means the stores cannot be bulk merged),
and you're not on an SSD, and your OS is memory starved so it can't do
as much readahead as it should be doing, your merges become far more
costly.  Worse, the default merge thread count (3) may in fact be too high
for most machines even with 4 or more cores.  There are many variables...

The scheduling can only do so much, of course.  Ie it enables us to
soak up the spare CPU cycles in between medium, little merges to let
the bit merge make progress.  But if those spare cycles aren't enough
then inevitably the best scheduling will still have to eventually
pause your reopens.

Still I think the other improvements we've talked about here would be
great steps forward.  It's just that we still need to explicitly
schedule the merge threads.


 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932561#action_12932561
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

Shai:
bq. The thing is - the second round of merges may only be detected by MP after 
most if not all of the merges from the previous round ended. Therefore, if we 
dim that functionality important (and I do), we must have the MergeThreads 
query MP as well.
Don't turn things upside down. MergeThreads are best hidden from view inside 
Executor, we just feed it chunks of work in the order we want it to be done. 
It's quite enough to have a single place that queries MP and feeds Executor.

bq. About mergeInit, I took a look too and discovered a 140 lines method, so I 
doubt it does only new segment name registration. But perhaps I'm wrong and 
those are redundant 140 lines ...
These are not redundant. But only a small number of them really need to be 
executed when a merge is added to CMS, everything else can wait until merge 
actually starts.

bq. I think that having a MS entity (or MergeExecutor) is important - it still 
gives an app the ability to override things if it wants to.
Tentatively agree on this.

bq. Also, putting the MS code inside IW will only add code to it, 
But this is not true, with all the cruft thrown away, remaining code size is 
likely on par with current that invokes MS from within IW.

bq. We define MP as the one responsible for returning the merges in the order 
it wants, and provide the necessary support by making OneMerge comparable
You contradict yourself here. If we make OneMerge comparable, we define order 
in its compareTo() method. MP can no longer return the merges in the order it 
wants. I suggest there's no Comparable, and MP does the sort itself and returns 
an ordered list of merges. Better yet - it returns only the first one of them, 
on each request.



Mike:
bq. In fact there is something wrong: without us explicitly scheduling the 
running merges (ie setting thread priorities, stopping the big merges when 
there are too many small ones), CMS will pause the incoming threads.
Why?? Indexing threads can drop merges into a queue, and forget about them. The 
blocking happens only if you explicitly want to use the same thread for 
merging, or if merging threads are lagging - in such a case you're in for 
troubles anyway.
All further analogies with OS scheduling are broken because we're not running 
ls, we're running background jobs, and don't really care which of them blocks 
which.

{quote}
bq.  Docs say small-sized segments are treated as with LogByteSizeMP.
Hmm... but then how does this differ from setting a maxMergeMB/Docs?
{quote}
BSMP doesn't keep your segments under certain size like maxMergeMB does. It 
ensures you have exactly N 'large' segments. How large they are - depends on 
total size of the index.
It also limits the maximum count of 'small' segments, the size distribution for 
them is the same as with LBSMP.
I think the [docs|http://code.google.com/p/zoie/wiki/ZoieMergePolicy] are 
pretty good.

{quote}
This seems dangerous: what's an important merge?

How about, instead, we let MP return all eligible merges (like it does
today) but then we replace all previously buffered but not yet running
merges w/ the new merges it returned? Hmm but this would probably
require giving it access to the buffered-but-not-yet-running merge
set...
{quote}
We already decided that merges should be sorted? And MP gives us a bunch of 
them. Sort the bunch, pick the first one - that's your important merge.
By calling this MP.giveMeNextImportantMerge() method repeatedly we free 
ourselves from tons of bookkeeping you just mentioned.

Ho-ho-ho! In fact, we don't need any queues, any buffering, no hard link 
between indexing and merging - no nothing at all :)
Here's my proposal cleaned up and reiterated:

* MP has a single method - getNextMerge() (I'm not taking optimize and friends 
into account now)
** For current policies this method works as described before - MP decides on a 
list of eligible merges, sorts them by some criteria (i.e. smaller merges come 
first) and returns the first.
* MS repeatedly polls MP for merges (in one or more threads) and executes them.
** If getNextMerge() returns null, MS goes to sleep - null signifies that from 
MP's perspective index is already in ideal state.
** When some index-changing events occur, eg - indexing thread adds some docs 
and creates a new segment, MS is woken up and resumes polling/merging.
** rinse, repeat
* In CMS-like scenario, indexing and merging threads are completely decoupled 
and even lack queues that could overflow. So if some lesser merges have to wait 
for the big one - that's not going to bite you at all.
* In SMS-like scenario, polling happens within indexing thread, and everything 
works as it does now.

{quote}
Still I think the other improvements we've talked 

[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932599#action_12932599
 ] 

Jason Rutherglen commented on LUCENE-2755:
--

bq. large merge can easily kill your NRT reopens

When RT is implemented, these small segments (that require merging) go
away because we'll be flushing relatively medium sized segments the size
of the RAM buffer.

I'd prefer to have more control of large merges from an external process
so that they may be scheduled according to application demand, ie, during
non-peak hours. This is actually what I've implemented in production using
things like optimize num segments = 5 and/or expunge deletes during the
early morning hours. However the external control is doable today using an
existing merge policy such as LogByteSizeMergePolicy, where during the day
for example, the maximum segment size could be lower, and in the early
morning, it'd be set to something much higher, or nullified altogether. 

Also the problems associated with merge interleaving only affects systems
that are not using replication because the merging is occurring on an
index-only server. The newly merged files are transferred over as is, and
that's something that can easily be interleaved with other IO processes.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-16 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932807#action_12932807
 ] 

Shai Erera commented on LUCENE-2755:


bq. You contradict yourself here. If we make OneMerge comparable, we define 
order in its compareTo() method. 

I think it's convenient to have OneMerge comparable somehow. But we can have MP 
sort them using its own Comparator. By making them Comparable I intended to say 
'this is the default order' - but we can have a DefaultComparator instead.

If we proceed w/ your proposal, that is basically the MS/ME polling MP, and not 
IW doing so, how would IW know about the running merges and pending ones? Today 
IW tracks those two lists so that if you need to abort merges, it knows which 
ones to abort.

We can workaround aborting the running merges by introducing a MS.abort()-like 
method. But what about MP? Now the lists are divided between too entities (MP 
and MS), and aborting a MP does not make sense (doable, but I don't think it 
belongs there). Maybe we can have MS.abort() poll MP for next merges until it 
returns null, and throwing all the returned ones away - that can be done. 
Aborting an Executor is easy, and I think can be faster than our current way of 
doing so.

I would still love to see the merge code (as much as possible) going away from 
IW. This may not be doable now, but could be in the future, if we factor out a 
SegmentsMerger/IndexMerger entity which encapsulates the merge execution and 
policy inside. But this is for another day.

BTW, MS.merge() takes an IW, as if you could call merge() w/ two IW instances 
and things will work ok. It does in SMS but doesn't in CMS. Should we, in the 
scope of this issue, make IW a required settable parameter on MS, like we do w/ 
MP?

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932072#action_12932072
 ] 

Shai Erera commented on LUCENE-2755:


The problem with ThreadPoolExecutor is that its submit() doesn't block on the 
queue, even if you pass a bounded ArrayBlockingQueue (which is really silly 
IMO). I was hoping we can super simplify CMS logic by letting a BlockingQueue 
throttle the number of merges we 'register' before CMS itself waits, and the 
ExecutorService instead of the MergeThreads and their management.

Unfortunately this does not look to be the case. Here is an alternative 
solution which looks a nice workaround: 
http://stackoverflow.com/questions/2001086/how-to-make-threadpoolexecutors-submit-method-block-if-it-is-saturated.

The idea is to block the call to ExecutorService.execute() through a Semaphore. 
In that case, I think it's safe to not use a blocking queue at all, because the 
throttling will be handled by the Semaphore.

Another alternative is to use a CallerRunsPolicy as the rejection policy, which 
has many disadvantages (such as potentially starving the other threads if the 
caller gets to execute the heavy task, or risking running the tasks by N+1 
threads etc.).

Earwin - if we make OneMerge comparable, we give any MP the freedom to decide 
the order merges will run. In my case it is important because I'm getting a 
certain time frame to run index optimization, and prefer to reduce as many 
segments as possible, therefore I choose to run the smaller merges first. I 
think it's a reasonable decision anyway as a default, because even if you call 
close(false) (not waiting for merges), then it's better if some merges have 
already finished and committed, thereby you're making forward progress all the 
time, vs. if you run merges in arbitrary order you mind not finish any merge.

I agree though that in some situations apps won't care, in which case sorting 
by merge size will be as good as random ordering.

Pausing large merges is something that I consider less important though, but I 
don't want to break back-compat behavior. IMO if a merge started - let it 
finish. You don't know how much work it has completed, how much work is left, 
and how much work does the 'smaller' merge has (what if say it's smaller by 1 
byte/doc?). In different situations the best decision might be different, 
therefore IMO we shouldn't pause threads - rather let the MP decide up front 
the order of the merges (if it wants to) and then execute them in that order.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932124#action_12932124
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

Whatever solution for block-on-add you employ, I think it is important to 
implement it as an Executor. I think, people can benefit from threading policy 
being pluggable.

I'm not against sorting merges, it's so simple, even if useless. Though maybe 
it's better to use Comparator, so you can redefine the order? Pausing large 
merges is another issue - that's a freakload of complexity for zero gain.

Another issue to ponder - what about slightly uncluttering IW - MS 
interaction?
We drop IW.getNextMerge, MS.merge(IW), and replace them with 
MS.scheduleMerge(MP.OM), so instead of IW asking MS to pull all merges from 
itself, it simply pushes them.
Also, let's kill this weeeird IW.mergeInit that is called from CMS, but not SMS 
:)


But oh, well. With introduction of executors, and SMS being folded as a special 
case of CMS, we might as well drop MS completely and move what little code is 
left straight to IW, which will now accept an executor.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932162#action_12932162
 ] 

Michael McCandless commented on LUCENE-2755:


bq. We drop IW.getNextMerge, MS.merge(IW), and replace them with 
MS.scheduleMerge(MP.OM), so instead of IW asking MS to pull all merges from 
itself, it simply pushes them.

That sounds like a great simplification!

bq. I'm not against sorting merges, it's so simple, even if useless. Though 
maybe it's better to use Comparator, so you can redefine the order? Pausing 
large merges is another issue - that's a freakload of complexity for zero gain.

Pausing large merges is (unfortunately) important for full use of available 
concurrency.  Otherwise, when a laaarge merge is taking place, it causes to to 
fully stop your indexing threads unnecessarily.  Turn on infoStream when 
building a large index and you'll see...

An OS CPU scheduler will lower the priority of long-running CPU hogging 
processes, for the same reason (so that newly started CPU hog processes that 
are short running get nearly 100% of the CPU so they finish fast).  It's just 
that we don't have the freedom to allow an unbounded number of merges that we 
must approximate this by explicitly pausing the long running merges.

bq. Also, let's kill this weeeird IW.mergeInit that is called from CMS, but not 
SMS

There was some reason why this needed to be called by CMS but not SMS but I 
can't remember why.  (It's re-called by IW.merge in case the MS didn't already 
call it).  But it'd be great to not call it from CMS if it's not necessary... I 
can't remember the reason.

bq. With introduction of executors, and SMS being folded as a special case of 
CMS, we might as well drop MS completely and move what little code is left 
straight to IW, which will now accept an executor.

That's tempting... but people use MSs eg to schedule big merges at different 
times.  I don't think we should outright drop MS.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932182#action_12932182
 ] 

Jason Rutherglen commented on LUCENE-2755:
--

A good feature for Solr could be the ability to via an HTTP call kick-off 
pending large merges.  They could then be scheduled via a cron job and based on 
other factors, such as whether or not other indexing tasks are running.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932200#action_12932200
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

bq. There was some reason why this needed to be called by CMS but not SMS but I 
can't remember why.
That has something to do with assigning new segment names, if you believe the 
comments.
But IW.mergeInit does a freakload of other stuff! I think assigning names can 
happen in a separate place, before OneMerge is submitted to MS.

bq. Otherwise, when a laaarge merge is taking place, it causes to to fully stop 
your indexing threads unnecessarily
I still think this can be mitigated in more appropriate ways. Like allocating 
big enough pending merges queue to wait until the long one finishes.
Indexing threads push merges into the queue (with CMS) and don't block.
Plus to that, you can use nice policies like BalancedSegmentMergePolicy, that 
prevent UBER-merges from occuring at all.

bq. That's tempting... but people use MSs eg to schedule big merges at 
different times. I don't think we should outright drop MS.
That exact use case is totally wrong. MergePolicy decides which merges should 
run NOW, MergeScheduler executes them.
If a certain big merge should run only within some specific timeframe, 
MergePolicy should not return it when asked for eligible merges.

In your sample, when decision-making is smeared across classes, the merges 
created by MP and deferred by MS are stale when their
time comes. If asked now, MP would include some additional segments in the 
merge that MS stalled around for ages.

For a glance of things done relatively right, take a look at BSMP - it has 
setPartialExpunge method, that alters its
behaviour to include some expensive housecleaning. It is supposed you do 
setPartialExpunge(true) at the beginning of your quiet period, and
setPartialExpunge(false) when it ends.

bq. A good feature for Solr could be the ability to via an HTTP call kick-off 
pending large merges.  They could then be scheduled via a cron job and based on 
other factors, such as whether or not other indexing tasks are running.
Same argument here. The place for such decisions is MergePolicy.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932234#action_12932234
 ] 

Michael McCandless commented on LUCENE-2755:


{quote}
That has something to do with assigning new segment names, if you believe the 
comments.
But IW.mergeInit does a freakload of other stuff! I think assigning names can 
happen in a separate place, before OneMerge is submitted to MS.
{quote}

If indeed that's all then I agree, let's just assign the name up front and then 
CMS need not call mergeInit.

{quote}
bq. Otherwise, when a laaarge merge is taking place, it causes to to fully stop 
your indexing threads unnecessarily

I still think this can be mitigated in more appropriate ways. Like allocating 
big enough pending merges queue to wait until the long one finishes.
Indexing threads push merges into the queue (with CMS) and don't block.
{quote}

But then you accumulate too many tiny merges, while waiting for the big one to 
finish?

bq. Plus to that, you can use nice policies like BalancedSegmentMergePolicy, 
that prevent UBER-merges from occuring at all.

Maybe we should move BSMP to core and make it the default?

But I don't fully understand how it chooses merges.  EG does it pick lopsided 
merges (where the segments differ substantially in size), as long as they are 
small segments?

{quote}
MergePolicy decides which merges should run NOW, MergeScheduler executes them.
If a certain big merge should run only within some specific timeframe, 
MergePolicy should not return it when asked for eligible merges.
{quote}

I agree there is ambiguity here, which is not good.  It is tempting to nuke 
MergeScheduler (absorb CMS into IW, w/ SMS a special case) and define 
MergePolicy to only return merges which should run right now... that would be a 
nice simplification.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932266#action_12932266
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

bq. But then you accumulate too many tiny merges, while waiting for the big one 
to finish?
You say this, as if it was something terribly wrong. :)
Big merges aren't heffalumps, they don't usually stalk IW in droves. Big merge 
ends sooner or later, and tiny ones go out in a flash.

bq. Maybe we should move BSMP to core and make it the default?
Dunno. The index you end up with is larger than with LogWhateverMP.
But you get a nice benefit of having roughly equal-sized big segments, which is 
cool for running collection in parallel.
Everyone has his own requirements.

bq. But I don't fully understand how it chooses merges. EG does it pick 
lopsided merges (where the segments differ substantially in size), as long as 
they are small segments?
Docs say small-sized segments are treated as with LogByteSizeMP.



Another thought I had looking through the code. We have seriously inefficient 
merge conflict resolution algorithm on our hands.
We just damn drop all new merges that have segments in common with the merges 
already queued (but not yet running!!).
What does that mean?

Imagine we're producing a slew of mini-segments with decent speed and our 
MergeScheduler is lagging behind:
* new seg1
* new seg2
* queue merge seg1+seg2
* start merge seg1+seg2
* new seg3
* new seg4
* queue merge seg3+seg4
* new seg5
* FAIL queue merge seg3+seg4+seg5
* new seg6
* FAIL queue merge seg3+seg4+seg5+seg6
* finish merge seg1+seg2
* start merge seg3+seg4

By that point we should really start merging of all four last segments (maybe 
together with the result of seg1+seg2).
But in reality we'll merge seg3+seg4, than seg5+seg6 and then all of three 
merge results together (provided no new mini-segments are added).

If we throw large merges into the loop (whether pausable or not) the situation 
is amplified.

Ugly solution - when MP suggests a merge that is a strict superset of a queued, 
but not yet running merge - drop the old one, use the new.
Better solution - instead of asking MP for all the merges it deems reasonable 
on current index, we only ask it for most important one.
And we do it each time MS has an open slot for execution. This way each merge 
happening is the best merge possible at that moment.

Please, correct my wrongs, if any.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932356#action_12932356
 ] 

Shai Erera commented on LUCENE-2755:


I have to say I totally agree w/ Earwin - in my mind, the MP should be the one 
deciding what merges to run, and in what order, the MS should be the one 
executing them. Hack, we should really call MS a MergeExecutor, since it 
doesn't really schedule anything, just does what it's told, based on the 
execution policy (parallel or blocking).

In my app, it's the MP which decides which merges to run, based on their sizes, 
time of the day and allotted time to run. I expect MS to faithfully do what 
it's told and don't play tricks on me (like pausing merges I've asked it to 
run), 'cause otherwise I'll need to write a MS too.

Another question that was brought up here is who should register the merges. 
Today there are two entities - one is the MS which repeatedly calls 
IW.getNextMerge() and in the CMS case, MergeThread does so too. The 
disadvantage of that is that IW.getNextMerge() (and mergeInit()) is called from 
two places, but the advantage is that it allows executing all of MP merges, 
even if they come in several rounds. Example, you have 4 segs of level 0 and 3 
segs of level 1 (mergeFactor=4). MP returns a single merge (4 segs level 0), 
and IW.getNextMerge() returns null 'cause there are no merges left). However, 
after completing this merge, if asked, MP will return a follow on merge of 4 
segs level 1. That will be picked up by the MergeThread that calls 
IW.getNextMerge().

The thing is - the second round of merges may only be detected by MP after most 
if not all of the merges from the previous round ended. Therefore, if we dim 
that functionality important (and I do), we must have the MergeThreads query MP 
as well.

About mergeInit, I took a look too and discovered a 140 lines method, so I 
doubt it does only new segment name registration. But perhaps I'm wrong and 
those are redundant 140 lines ...

I think that having a MS entity (or MergeExecutor) is important - it still 
gives an app the ability to override things if it wants to. Also, putting the 
MS code inside IW will only add code to it, and I think that we should really 
start refactoring it down to smaller, more readable and focused, pieces. So I'm 
against adding more logic to it. For 3x we can choose to improve things 
internally, or leave them as they are. For 4.0 I'd suggest we do the following:
* Create a new MergeExecutor entity receives an Executor(Service?) to run with, 
but also defaults to one (like CMS is today). That replaces CMS.
** You can control the number of threads it runs with (1 for almost-SMS-like 
behavior and more for CMS).
* We create a BlockingMergeExecutor, which regardless of how many threads you 
allow it to run with, blocks until all merges finish.
** That is an improved SMS - I've always thought that blocking until merges 
finish is not related to how many threads you'd like to execute merges with. 
E.g., if you set CMS's # threads to 1, you get a sort of an SMS behavior, only 
the call is non-blocking.
* We define MP as the one responsible for returning the merges in the order it 
wants, and provide the necessary support by making OneMerge comparable (Earwin, 
MP can still sort by using a custom Comparator, we only provide a default 
comparison method for merges). That definition will be mostly in javadocs.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a 

[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931796#action_12931796
 ] 

Shai Erera commented on LUCENE-2755:


I've looked into integrating an ExecutorService and I think it can really 
simplify things, as long as we can let go of CMS sorting merges by their size. 
And I think - why should it? What if we make it to MP's decision? Namely, if 
you care about which merges run first, have your MP sort them the way you want, 
before you return them. If we make OneMerge comparable, that should be a very 
trivial extension one has to make to MP (extend MP, override the methods, call 
super.method() and then sort accordingly).

If we do that, then SMS and CMS will work the same - execute the merges in the 
order returned by MP, only CMS will do so in parallel and SMS will do so 
synchronously. By reading that you can already see that later (in a separate 
issue), we can let go of SMS entirely - it will be a single-threaded 
ExecutorService, w/ an implementation to wait until the work competes (unlike 
CMS which returns immediately). But that's for another day.

What do you think?

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931809#action_12931809
 ] 

Michael McCandless commented on LUCENE-2755:


Cutting over to ExecutorService and letting MergePolicy dictate how OneMerge 
compares sounds great!

Will CMS still have the ability to pause running big merges in order to let 
smaller ones complete?  Seems like these changes should still allow that to 
work correctly.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931468#action_12931468
 ] 

Shai Erera commented on LUCENE-2755:


Ok, so not calling IndexWriter.getNextMerge() before we know we can register 
that merge is problematic. The reason is we want to know if there is a next 
merge before we check if it can be registered. If not, the method returns 
immediately. Otherwise, we'll wait until any merge can be registered, just to 
discover there are no more merge.

So one solution can be to add to IW a hasMerges() and in CMS wait for room to 
become available only if there are merges.

Another solution is to do a larger change to CMS and introduce an 
ExecutorService - this has been raised in the past, so perhaps it's time to 
finally do it? By using a blocking queue, we don't need to implement any 
waiting logic - Java will do it for us.

The downside of that is that I'm not sure we can control which of the merges 
runs and which isn't. Perhaps we can hack this through - I'll need to start the 
process to tell for sure. This feature is important - today CMS guarantees the 
smaller merges run first - so it might be that a larger merge was registered 
before a smaller merge, and we'd still want to execute the smaller one before 
the larger.

A third solution would be to not do anything and keep things as they are - 
namely let some merge be held by CMS until it can be executed.

Just summarizing my thoughts for now.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-11 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931015#action_12931015
 ] 

Yonik Seeley commented on LUCENE-2755:
--

bq. CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
size segments anymore, I think we should switch the default impl.

That's probably a good idea.
We should also scale this by the number of deleted docs... so if a segment has 
10,000 documents, is 10MB in size, and has 2000 deleted documents, then we 
should consider it as 8MB for the purposes of selecting to merge?

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2010-11-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931086#action_12931086
 ] 

Shai Erera commented on LUCENE-2755:


Right! Hence why I wrote it is best determined by MP. LogMP already does it by 
default (can be disabled).

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org