[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527304
 ] 

Michael McCandless commented on LUCENE-847:
---

> Looks like some anomalous tests. Last night I checked twice, but
> today results are: 58 to 48 in favor of Concurrent. I am going to
> assume my first results where invalid. Sorry for the noise and
> thanks for the great patch.

OK, phew!

> Has passed quite a few stress tests I run on my app without any
> problems so far.

I'm glad to hear that :)  Thanks for being such an early adopter!

> Do both merge policies allow for a closer to constant add time or is
> it just the Concurrent policy?

Not sure I understand the question -- you mean addDocument?  Yes it's
only ConcurrentMergeScheduler that should keep addDocument calls
constant time, because SerialMergeScheduler will hijack the addDocument
thread to do its merges.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527300
 ] 

Mark Miller commented on LUCENE-847:


Looks like some anomalous tests. Last night I checked twice, but today results 
are: 58 to 48 in favor of Concurrent. I am going to assume my first results 
where invalid. Sorry for the noise and thanks for the great patch. Has passed 
quite a few stress tests I run on my app without any problems so far. Do both 
merge policies allow for a closer to constant add time or is it just the 
Concurrent policy?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527297
 ] 

Michael McCandless commented on LUCENE-847:
---

> I have to triple check, but on first glance, my apps performance
> halfed using the ConcurrentMergeScheduler on a recent core duo with
> 2 GB RAM (As compared to the SerialMergeSceduler). Seems unexpected?

Whoa, that's certainly unexpected!  I'll go re-run my perf test.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527295
 ] 

Michael McCandless commented on LUCENE-847:
---


> Today, applications use multiple threads on IndexWriter to get some
> concurrency on document parsing. With this patch, applications that
> want concurrent merges would simply use ConcurrentMergeScheduler,
> no?

True.  OK I will make SerialMergeScheduler.merge serialized.  This way
only one merge can happen at a time even when the application is using
multiple threads.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527289
 ] 

Mark Miller commented on LUCENE-847:


I have to triple check, but on first glance, my apps performance halfed using 
the ConcurrentMergeScheduler on a recent core duo with 2 GB RAM (As compared to 
the SerialMergeSceduler). Seems unexpected?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527286
 ] 

Ning Li commented on LUCENE-847:


> This was actually intentional: I thought it fine if the application is
> sending multiple threads into IndexWriter to allow merges to run
> concurrently.  Because, the application can always back down to a
> single thread to get everything serialized if that's really required?

Today, applications use multiple threads on IndexWriter to get some concurrency 
on document parsing. With this patch, applications that want concurrent merges 
would simply use ConcurrentMergeScheduler, no?

Or a rename since it doesn't really serialize merges?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527258
 ] 

Michael McCandless commented on LUCENE-847:
---

> Hmm, it's actually possible to have concurrent merges with
> SerialMergeScheduler.

This was actually intentional: I thought it fine if the application is
sending multiple threads into IndexWriter to allow merges to run
concurrently.  Because, the application can always back down to a
single thread to get everything serialized if that's really required?


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527239
 ] 

Ning Li commented on LUCENE-847:


Hmm, it's actually possible to have concurrent merges with SerialMergeScheduler.

Making SerialMergeScheduler.merge synchronize on SerialMergeScheduler will 
serialize all merges. A merge can still be concurrent with a ram flush.

Making SerialMergeScheduler.merge synchronize on IndexWriter will serialize all 
merges and ram flushes.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527227
 ] 

Michael McCandless commented on LUCENE-847:
---

Ahh, good catch.  Will fix!

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-13 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527224
 ] 

Ning Li commented on LUCENE-847:


Access of mergeThreads in ConcurrentMergeScheduler.merge() should be 
synchronized.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526712
 ] 

Michael McCandless commented on LUCENE-847:
---

> > Good idea! I took exactly this approach in patch I just attached. I
> > made a simple change: LogMergePolicy.findMergesForOptimize first
> > checks if "normal merging" would want to do merges and returns them if
> > so. Since "normal merging" exposes concurrent merges, this gains us
> > concurrency for optimize in cases where the index has too many
> > segments. I wasn't sure how otherwise to expose concurrency...
>
> Another option is to schedule merges for the newest N segments and
> the next newest N segments and the next next... N is the merge
> factor.

OK, that is simpler.  I'll take that approach (and not call the
"normal" merge policy first).

> A couple of other things:
> 
>   - It seems you intended sync() to be part of the MergeScheduler
> interface?

I had started down this route but then backed away from it: I think
IndexWriter should handle this rather than making every MergeScheduler
have duplicated code for doing so.  Oh I see, I had left empty sync()
in SerialMergeScheduler; I'll remove that.

>  - IndexWriter.close([true]), abort(): The behaviour should be the
>same whether the calling thread is the one that actually gets to do
>the closing. Right now, only the thread that actually does the
>closing waits for the closing. The others do not wait for the
>closing.

Ahh good point.  OK, I'll have other threads wait() until the
close/abort is complete.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-11 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526628
 ] 

Ning Li commented on LUCENE-847:


> OK, another rev of the patch (take6).  I think it's close!

Yes, it's close! :)

> I made one simplification to the approach: IndexWriter now keeps track
> of "pendingMerges" (merges that mergePolicy has declared are necessary
> but have not yet been started), and "runningMerges" (merges currently
> in flight).  Then MergeScheduler just asks IndexWriter for the next
> pending merge when it's ready to run it.  This also cleaned up how
> cascading works.

I like this simplification.

>   * Optimize: optimize is now fully concurrent (it can run multiple
> merges at once, new segments can be flushed during an optimize,
> etc).  Optimize will optimize only those segments present when it
> started (newly flushed segments may remain separate).

This semantics does add a bit complexity - segmentsToOptimize, 
OneMerge.optimize.

> Good idea!  I took exactly this approach in patch I just attached.  I
> made a simple change: LogMergePolicy.findMergesForOptimize first
> checks if "normal merging" would want to do merges and returns them if
> so.  Since "normal merging" exposes concurrent merges, this gains us
> concurrency for optimize in cases where the index has too many
> segments.  I wasn't sure how otherwise to expose concurrency...

Another option is to schedule merges for the newest N segments and the next 
newest N segments and the next next... N is the merge factor.


A couple of other things:

  - It seems you intended sync() to be part of the MergeScheduler interface?

  - IndexWriter.close([true]), abort(): The behaviour should be the same 
whether the calling thread is the one that actually gets to do the closing. 
Right now, only the thread that actually does the closing waits for the 
closing. The others do not wait for the closing.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526305
 ] 

Michael McCandless commented on LUCENE-847:
---


> In the while loop of optimize(), LogMergePolicy.findMergesForOptimize
> returns a merge spec with one merge. If ConcurrentMergeScheduler is
> used, the one merge will be started in MergeScheduler.merge() and
> findMergesForOptimize will be called again. Before the one merge
> finishes, findMergesForOptimize will return the same spec but the
> one merge is already started. So only one concurrent merge is
> possible and the main thread will spin on calling
> findMergesForOptimize and attempting to merge.

Yes.  The new patch has cleaned this up nicely, I think.

> One possible solution is to make LogMergePolicy.findMergesForOptimize
> return multiple merge candidates. It allows higher level of
> concurrency.

Good idea!  I took exactly this approach in patch I just attached.  I
made a simple change: LogMergePolicy.findMergesForOptimize first
checks if "normal merging" would want to do merges and returns them if
so.  Since "normal merging" exposes concurrent merges, this gains us
concurrency for optimize in cases where the index has too many
segments.  I wasn't sure how otherwise to expose concurrency...

> It also alleviates a bit the problem of main thread spinning. To
> solve this problem, maybe we can check if a merge is actually
> started, then sleep briefly if not (which means all merges
> candidates are in conflict)?

This is much cleaner in new patch: there is no more spinning.  In new
patch if multiple threads are merging (either spawned by
ConcurrentMergeaScheduler or provided by the application or both) then
they all pull from a shared queue of "merges needing to run" and then
return when that queue is empty.  So no more spinning.

> One difference between the current approach on concurrent merge and
> the patch I posted a while back is that, in the current approach, a
> MergeThread object is created and started for every concurrent
> merge. In my old patch, maxThreadCount of threads are created and
> started at the beginning and are used throughout. Both have pros and
> cons.

Yeah I thought I would keep it simple (launch thread when needed then
let it finish when it's done) rather than use a pool.  This way
threads are only created (and are only alive) while concurrency is
actually needed (ie > N merges necessary at once).  But yes there are
pros/cons either way.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-09 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526029
 ] 

Ning Li commented on LUCENE-847:


Comments on optimize():

  - In the while loop of optimize(), LogMergePolicy.findMergesForOptimize 
returns a merge spec with one merge. If ConcurrentMergeScheduler is used, the 
one merge will be started in MergeScheduler.merge() and findMergesForOptimize 
will be called again. Before the one merge finishes, findMergesForOptimize will 
return the same spec but the one merge is already started. So only one 
concurrent merge is possible and the main thread will spin on calling 
findMergesForOptimize and attempting to merge.

  - One possible solution is to make LogMergePolicy.findMergesForOptimize 
return multiple merge candidates. It allows higher level of concurrency. It 
also alleviates a bit the problem of main thread spinning. To solve this 
problem, maybe we can check if a merge is actually started, then sleep briefly 
if not (which means all merges candidates are in conflict)?


A comment on concurrent merge threads:

  - One difference between the current approach on concurrent merge and the 
patch I posted a while back is that, in the current approach, a MergeThread 
object is created and started for every concurrent merge. In my old patch, 
maxThreadCount of threads are created and started at the beginning and are used 
throughout. Both have pros and cons.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525903
 ] 

Michael McCandless commented on LUCENE-847:
---

OK, as a better test of ConcurrentMergeScheduler, and towards making it
the default merge scheduler, I tried making it the default in
IndexWriter and then ran all unit tests, and uncovered problems with
the current patch (notably how optimize works!).  So I'm working on an
new patch now



> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525802
 ] 

Michael McCandless commented on LUCENE-847:
---

> Is there any reason not to make ConcurrentMergeScheduler the default too 
> after this is committed?

Good question.  The only downsides I can think of are:

  * It's all fresh code so until we let it "age" some, it's a higher
risk that something broke.  That said there is decent unit test
coverage for it and these unit tests did find some sneaky issues
(which I fixed!).

  * It only actually helps on machines that have some concurrency.
But in this case we are largely talking about IO concurrent w/ CPU
which nearly all machines have I think.

I think the benefits are sizable:

  * Good performance gains (25% speedup of net indexing time for all
of Wikipedia content -- details in LUCENE-870)

  * Trivial way to leverage concurrency (ie you don't need to manage
your own threads).

  * No more unexpected long pauses on certain addDocument calls.

So I think it would make sense to make it the default.  I'll include
that in the new issue for changing defaults in IndexWriter.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-09-07 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525797
 ] 

Doug Cutting commented on LUCENE-847:
-

Is there any reason not to make ConcurrentMergeScheduler the default too after 
this is committed?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.take5.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-31 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524138
 ] 

Michael McCandless commented on LUCENE-847:
---


> > Not quite following you here... not being eligible because the
> > merge is in-progress in a thread is something I think any given
> > MergePolicy should not have to track?  Once I factor out CMPW as
> > its own Merger subclass I think the eligibility check happens only
> > in IndexWriter?
>
> I was referring to the current patch: LogMergePolicy does not check
> for eligibility, but CMPW, a subclass of MergePolicy, checks for
> eligibility. Yes, the eligibility check only happens in IndexWriter
> after we do Merger class.

OK, let's leave eligibility check in IW.

> > Rename to/from what?  (It is currently called
> > MergePolicy.optimize).  IndexWriter steps through the merges and
> > only runs the ones that do not conflict (are eligible)?
> 
> Maybe rename to MergePolicy.findMergesToOptimize?

OK, that's good.

> > > The reason I asked is because none of them are used right
> > > now. So they might be used in the future?
> > 
> > Both of these methods are now called by IndexWriter (in the
> > patch), upon flushing a new segment.
> 
> I was referring to the parameters. The parameters are not used.

Ahh, got it.  Yes the thinking is merge policies in the future may
want to look @ segmentinfos to decide.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-31 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524084
 ] 

Ning Li commented on LUCENE-847:


> Not quite following you here... not being eligible because the merge
> is in-progress in a thread is something I think any given MergePolicy
> should not have to track?  Once I factor out CMPW as its own Merger
> subclass I think the eligibility check happens only in IndexWriter?

I was referring to the current patch: LogMergePolicy does not check for 
eligibility, but CMPW, a subclass of MergePolicy, checks for eligibility. Yes, 
the eligibility check only happens in IndexWriter after we do Merger class.

> Rename to/from what?  (It is currently called MergePolicy.optimize).
> IndexWriter steps through the merges and only runs the ones that do
> not conflict (are eligible)?

Maybe rename to MergePolicy.findMergesToOptimize?

> > The reason I asked is because none of them are used right now. So
> > they might be used in the future?
> 
> Both of these methods are now called by IndexWriter (in the patch),
> upon flushing a new segment.

I was referring to the parameters. The parameters are not used.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-31 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524039
 ] 

Michael McCandless commented on LUCENE-847:
---


> > Good point...  I think I could refactor this so that cascading logic
> > lives entirely in one place IndexWriter.
>
> Another problem of the current cascading in CMPW.MergeThread is, if
> multiple candidate merges are found, all of them are added to
> IndexWriter.mergingSegments. But all but the first should be removed
> because only the first merge is carried out (thus removed from
> mergeSegments after the merge is done).

You're right -- I'm only doing the first non-conflicting merge in
CMPW (but then not releasing the rest of them).  I think this would be
fixed by having cascading logic only in IndexWriter.

> How do you make cascading live entirely in IndexWriter? Just
> removing cascading from CMPW.MergeThread has one drawback.  For
> example, segment sizes of an index are: 40, 20, 10, buffer size is
> 10 and merge factor is 2. A buffer full flush of 10 will trigger
> merge of 10 & 10, then cascade to 20 & 20, then cascade to 40 &
> 40. CMPW without cascading will stop after 10 & 10 since
> IndexWriter.maybeMerge has already returned. Then we have to wait
> for the next flush to merge 20 & 20.

Oh, I would remove from CMPW and add then add it into IndexWriter (so
the scenario above would cascade normally).  Meaning, IndexWriter,
upon completing a merge, would always consult the policy for whether
the completed merge has now enabled any new merges.

This is somewhat messy though (with CMPW as a MergePolicy) because
then findCandidateMerges would need to know if it was being called
(due to cascading) under one of its own threads and if so act
differently.  Another good reason to make it a separate Merger
subclass.

> > How would this be used?  Ie, how would one make an IndexWriter
> > that uses the ConcurrentMerger?  Would we add expert methods
> > IndexWriter.set/getIndexMerger(...)?  (And presumably the
> > mergePolicy is now owned by IndexMerger so it would have the
> > set/getMergePolicy(...)?)
> > 
> > Also, how would you separate what remains in IW vs what would be
> > in IndexMerger?
>
> Maybe Merger does and only does merge (so IndexWriter still owns
> MergePolicy)? Say, base class Merger.merge simply calls
> IndexWriter.merge. ConcurrentMerger.merge creates a merge thread if
> possible. Otherwise it calls super.merge, which does non-concurrent
> merge. IndexWriter simply calls its merger's merge instead of its
> own merge. Everything else remains in IndexWriter.

OK I will test out this approach.

> > Hmm ... you're right.  This is a separate issue from merge policy,
> > right?  Are you proposing buffering deletes in DocumentsWriter
> > instead?
>
> Yes, this is a separate issue. And yes if we consider
>  DocumentsWriter as staging area.

I will open new issue.

> > Good catch!  How to fix?  One thing we could do is always use
> > SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
> > level.  This way using the default 'equals' (same instance) would
> > work.  Or we could establish identity (equals) of a SegmentInfo as
> > checking if the directory plus segment name are equal?  I think
> > I'd lean to the 2nd option
>
> I think the 2nd option is better.

I'll take this approach.

> > Hmmm yes.  In fact I think we can remove synchronized from
> > optimize altogether since within it we are synchronizing(this) at
> > the right places?  If more than one thread calls optimize at once,
> > externally, it is actually OK: they will each pick a merge that's
> > viable (concurrently) and will run the merge, else return once
> > there is no more concurrency left.  I'll add a unit test that
> > confirms this.
>
> That seems to be the case.

I'll add unit test to confirm.

> The fact that "the same merge spec will be returned without changes
> to segmentInfos" reminds me: MergePolicy.findCandidateMerges finds
> merges which may not be eligible; but CMPW checks for eligibility
> when looking for candidate merges. Maybe we should unify the
> behaviour?

Not quite following you here... not being eligible because the merge
is in-progress in a thread is something I think any given MergePolicy
should not have to track?  Once I factor out CMPW as its own Merger
subclass I think the eligibility check happens only in IndexWriter?

> BTW, MergePolicy.optimize (a rename?) doesn't check for eligibility
> either.

Rename to/from what?  (It is currently called MergePolicy.optimize).
IndexWriter steps through the merges and only runs the ones that do
not conflict (are eligible)?

> > Well, useCompoundFile(...) is given a single newly flushed segment
> > and should decide whether it should be CFS.  Whereas
> > useCompoundDocStore(...) is called when doc stores are flushed.
> > When autoCommit=false, segments can share a single set of doc
> > stores, so there's no s

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-30 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523957
 ] 

Ning Li commented on LUCENE-847:


> True, but I was thinking CMPW could be an exception to this rule.  I
> guess I would change the rule to "simple merge policies don't have to
> run their own merges"?

:) Let's see if we have to make that exception.

> Good point...  I think I could refactor this so that cascading logic
> lives entirely in one place IndexWriter.

Another problem of the current cascading in CMPW.MergeThread is, if multiple 
candidate merges are found, all of them are added to 
IndexWriter.mergingSegments. But all but the first should be removed because 
only the first merge is carried out (thus removed from mergeSegments after the 
merge is done).

How do you make cascading live entirely in IndexWriter? Just removing cascading 
from CMPW.MergeThread has one drawback. For example, segment sizes of an index 
are: 40, 20, 10, buffer size is 10 and merge factor is 2. A buffer full flush 
of 10 will trigger merge of 10 & 10, then cascade to 20 & 20, then cascade to 
40 & 40. CMPW without cascading will stop after 10 & 10 since 
IndexWriter.maybeMerge has already returned. Then we have to wait for the next 
flush to merge 20 & 20.

> How would this be used?  Ie, how would one make an IndexWriter that
> uses the ConcurrentMerger?  Would we add expert methods
> IndexWriter.set/getIndexMerger(...)?  (And presumably the mergePolicy
> is now owned by IndexMerger so it would have the
> set/getMergePolicy(...)?)
> 
> Also, how would you separate what remains in IW vs what would be in
> IndexMerger?

Maybe Merger does and only does merge (so IndexWriter still owns MergePolicy)? 
Say, base class Merger.merge simply calls IndexWriter.merge. 
ConcurrentMerger.merge creates a merge thread if possible. Otherwise it calls 
super.merge, which does non-concurrent merge. IndexWriter simply calls its 
merger's merge instead of its own merge. Everything else remains in IndexWriter.


1
> Hmm ... you're right.  This is a separate issue from merge policy,
> right?  Are you proposing buffering deletes in DocumentsWriter
> instead?

Yes, this is a separate issue. And yes if we consider DocumentsWriter as 
staging area.

2
> Good catch!  How to fix?  One thing we could do is always use
> SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
> level.  This way using the default 'equals' (same instance) would
> work.  Or we could establish identity (equals) of a SegmentInfo as
> checking if the directory plus segment name are equal?  I think I'd
> lean to the 2nd option

I think the 2nd option is better.

3
> Hmmm yes.  In fact I think we can remove synchronized from optimize
> altogether since within it we are synchronizing(this) at the right
> places?  If more than one thread calls optimize at once, externally,
> it is actually OK: they will each pick a merge that's viable
> (concurrently) and will run the merge, else return once there is no
> more concurrency left.  I'll add a unit test that confirms this.

That seems to be the case. The fact that "the same merge spec will be returned 
without changes to segmentInfos" reminds me: MergePolicy.findCandidateMerges 
finds merges which may not be eligible; but CMPW checks for eligibility when 
looking for candidate merges. Maybe we should unify the behaviour? BTW, 
MergePolicy.optimize (a rename?) doesn't check for eligibility either.

4
> Well, useCompoundFile(...) is given a single newly flushed segment and
> should decide whether it should be CFS.  Whereas
> useCompoundDocStore(...) is called when doc stores are flushed.  When
> autoCommit=false, segments can share a single set of doc stores, so
> there's no single SegmentInfo to pass down.

The reason I asked is because none of them are used right now. So they might be 
used in the future?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523798
 ] 

Michael McCandless commented on LUCENE-847:
---

Thanks for the detailed review Ning!

> 1 As you pointed out, "the merge policy is no longer responsible for
> running the merges itself". MergePolicy.maybeMerge simply returns a
> merge specification. But ConcurrentMergePolicyWrapper.maybeMerge
> actually starts concurrent merge threads thus doing the merges.

True, but I was thinking CMPW could be an exception to this rule.  I
guess I would change the rule to "simple merge policies don't have to
run their own merges"?

However I do agree, CMPW is clearly a different beast from a typical
merge policy because it entails scheduling, not selection, of merges.

> 2 Related to 1, cascading is done in IndexWriter in non-concurrent
> case. But in concurrent case, cascading is also done in merge
> threads which are started by
> ConcurrentMergePolicyWrapper.maybeMerge.

Good point...  I think I could refactor this so that cascading logic
lives entirely in one place IndexWriter.

> MergePolicy.maybeMerge should continue to simply return a merge
> specification. (BTW, should we rename this maybeMerge to, say,
> findCandidateMerges?)

Good!  I like findCandidateMerges better; I'll change it.

> Can we carve the merge process out of IndexWriter into a Merger?
> IndexWriter still provides the building blocks - merge(OneMerge),
> mergeInit(OneMerge), etc. Merger uses these building blocks. A
> ConcurrentMerger extends Merger but starts concurrent merge threads
> as ConcurrentMergePolicyWrapper does.

How would this be used?  Ie, how would one make an IndexWriter that
uses the ConcurrentMerger?  Would we add expert methods
IndexWriter.set/getIndexMerger(...)?  (And presumably the mergePolicy
is now owned by IndexMerger so it would have the
set/getMergePolicy(...)?)

Also, how would you separate what remains in IW vs what would be in
IndexMerger?

I like this approach in principle; I'm just trying to hash out the
details...

> 1 UpdateDocument's and deleteDocument's bufferDeleteTerm are
> synchronized on different variables in this patch.

Woops, good catch!  I'll fix.

> However, the semantics of updateDocument changed since
> LUCENE-843. Before LUCENE-843, updateDocument, which is a delete and
> an insert, guaranteed the delete and the insert are committed
> together (thus an update). Now it's possible that they are committed
> in different transactions. If we consider DocumentsWriter as the RAM
> staging area for IndexWriter, then deletes are also buffered in RAM
> staging area and we can restore our previous semantics, right?

Hmm ... you're right.  This is a separate issue from merge policy,
right?  Are you proposing buffering deletes in DocumentsWriter
instead?

> 2 OneMerge.segments seems to rely on its segment infos' reference to
> segment infos of IndexWriter.segmentInfos. The use in commitMerge,
> which calls ensureContiguousMerge, is an example. However,
> segmentInfos can be a cloned copy because of exceptions, thus the
> reference broken.

Good catch!  How to fix?  One thing we could do is always use
SegmentInfo.reset(...) and never swap in clones at the SegmentInfo
level.  This way using the default 'equals' (same instance) would
work.  Or we could establish identity (equals) of a SegmentInfo as
checking if the directory plus segment name are equal?  I think I'd
lean to the 2nd option

> 3 Calling optimize of an IndexWriter with the current
> ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec
> returned by MergePolicy.optimize may be in conflict with a
> concurrent merge (the same merge spec will be returned without
> changes to segmentInfos), but a concurrent merge cannot finish
> because optimize is holding the lock.

Hmmm yes.  In fact I think we can remove synchronized from optimize
altogether since within it we are synchronizing(this) at the right
places?  If more than one thread calls optimize at once, externally,
it is actually OK: they will each pick a merge that's viable
(concurrently) and will run the merge, else return once there is no
more concurrency left.  I'll add a unit test that confirms this.

> 4 Finally, a couple of minor things:
>
>   1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo
> info) and useCompoundDocStore(SegmentInfos infos): why the
> parameters?

Well, useCompoundFile(...) is given a single newly flushed segment and
should decide whether it should be CFS.  Whereas
useCompoundDocStore(...) is called when doc stores are flushed.  When
autoCommit=false, segments can share a single set of doc stores, so
there's no single SegmentInfo to pass down.

> 2 Do we need doMergeClose in IndexWriter? Can we simply close a
>   MergePolicy if not null?

Good point.  I think this is reasonable -- I'll fix.


> Factor merge policy out of IndexWriter
> 

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-29 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523621
 ] 

Ning Li commented on LUCENE-847:


I include comments for both LUCENE-847 and LUCENE-870 here since they are 
closely related.

I like the stateless approach used for refactoring merge policy. But modeling 
concurrent merge (ConcurrentMergePolicyWrapper) as a MergePolicy seems to be 
inconsistent with the MergePolicy interface:
  1 As you pointed out, "the merge policy is no longer responsible for running 
the merges itself". MergePolicy.maybeMerge simply returns a merge 
specification. But ConcurrentMergePolicyWrapper.maybeMerge actually starts 
concurrent merge threads thus doing the merges.
  2 Related to 1, cascading is done in IndexWriter in non-concurrent case. But 
in concurrent case, cascading is also done in merge threads which are started 
by ConcurrentMergePolicyWrapper.maybeMerge.

MergePolicy.maybeMerge should continue to simply return a merge specification. 
(BTW, should we rename this maybeMerge to, say, findCandidateMerges?) Can we 
carve the merge process out of IndexWriter into a Merger? IndexWriter still 
provides the building blocks - merge(OneMerge), mergeInit(OneMerge), etc. 
Merger uses these building blocks. A ConcurrentMerger extends Merger but starts 
concurrent merge threads as ConcurrentMergePolicyWrapper does.


Other comments:
1 UpdateDocument's and deleteDocument's bufferDeleteTerm are synchronized on 
different variables in this patch. However, the semantics of updateDocument 
changed since LUCENE-843. Before LUCENE-843, updateDocument, which is a delete 
and an insert, guaranteed the delete and the insert are committed together 
(thus an update). Now it's possible that they are committed in different 
transactions. If we consider DocumentsWriter as the RAM staging area for 
IndexWriter, then deletes are also buffered in RAM staging area and we can 
restore our previous semantics, right?

2 OneMerge.segments seems to rely on its segment infos' reference to segment 
infos of IndexWriter.segmentInfos. The use in commitMerge, which calls 
ensureContiguousMerge, is an example. However, segmentInfos can be a cloned 
copy because of exceptions, thus the reference broken.

3 Calling optimize of an IndexWriter with the current 
ConcurrentMergePolicyWrapper may cause deadlock: the one merge spec returned by 
MergePolicy.optimize may be in conflict with a concurrent merge (the same merge 
spec will be returned without changes to segmentInfos), but a concurrent merge 
cannot finish because optimize is holding the lock.

4 Finally, a couple of minor things:
  1 LogMergePolicy.useCompoundFile(SegmentInfos infos, SegmentInfo info) and 
useCompoundDocStore(SegmentInfos infos): why the parameters?
  2 Do we need doMergeClose in IndexWriter? Can we simply close a MergePolicy 
if not null?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Fix For: 2.3
>
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.take3.patch, LUCENE-847.take4.patch, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-18 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520881
 ] 

Steven Parkes commented on LUCENE-847:
--

my feeling is we should not deprecate
setUseCompoundFile, setMergeFactor, setMaxMergeDocs

I understood that you didn't want to deprecate them in IndexWriter. I wasn't 
sure that you meant that they should be added to the MergePolicy interface? If 
you do, everything makes sense. Otherwise, it sounds like there's still a cast 
in there and I'm not sure about that.

I think IndexWriter should enforce it?  Ie no merge policy should be
allowed to leave segments in other dirs (= at inconsistent index) at
point of commit.

I think it's just about code location: since a merge policy might want to 
factor into it's algorithm the directories used, it needs the info and it will 
presumably sometimes do it. Presumably you could provide code in 
MergePolicyBase so the merges could decide when but wouldn't have to write the 
copy loop. If you put the code in IndexWriter too, it sounds duplicated, again 
presuming sometimes a policy might want to do it itself. 

I like that idea :)  It fits well w/ the stateless API.  Ie, merge
policy returns all possible merges and "someone above" takes care of
scheduling them.

So it returns a vector of specs?

That's essentially what the CMP as an above/below wrapper does. I can see that 
above/below is strange enough to be less clever (I wasn't trying to be so much 
clever as backwards compatible) and more insane.

Sane is good.

Hmm.  This means each merge policy must know whether it's talking to
CMP or IndexWriter underneith?  With the stateless approach this
wouldn't happen.

Well, I wouldn't so much say it has to know. All it cares is what merge 
returns. Doesn't have to know who returned it or why.

The only real difference between this and the "generate a vector of merges" is 
that in the merge policy can take advantage immediately of merge results in the 
serial case where if you're generating a vector of merges, it can't know.

Of course, I guess in that case, if IndexWriter gets a vector of merges, it can 
always take the lowest and ignore the rest, calling the merge policy again 
incase it wants to request a different set. Then you only have the excess 
computation for merges you never really considered.

Oh I see...  that's kind of sneaky (planning on using exceptions to
abort a merge requested by the policy).

There's always going to be the chance of an exception to a merge. I'm pretty 
sure of that. But you're right, if the merge policy isn't in the control path, 
it would never see them. They'll be there, but it's out of the path.

But since you're already doing the work
to allow a merge to run in the BG without blocking adding of docs,
flushing, etc, wouldn't this come nearly for free?

I haven't looked at this.

Well, eg flush() now synchronizes on IndexWriter

Yeah, and making it not is less than straightforward. I've looked at his code a 
fair amount, experimented with different ideas, but hadn't gotten all the way 
to a working model.

You can look at locking segmentInfos but there are many places that 
segmentInfos is iterated over that would require locks if the lock on IW wasn't 
sufficient to guarantee that the iteration was safe.

I did look at that early on, so maybe my understanding was still too lacking 
and it's more feasible than I was thinking ...

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520874
 ] 

Michael McCandless commented on LUCENE-847:
---

> > I don't think so: I think if someone changes the merge policy to
> > something else, it's fine to require that they then do settings
> > directly through that merge policy.
>
> You're going to want to change the default merge policy, right?  So
> you're going to change the hard cast in IW to that policy? So it'll
> fail for anyone that wants to just getMergePolicy back to the old
> policy?

I don't really follow... my feeling is we should not deprecate
setUseCompoundFile, setMergeFactor, setMaxMergeDocs.

> > I think we shouldn't allow any mergePolicy to leave the index
> > inconsistent (failing to copy over segments from other
> > directories).
>
> That makes sense to me. CMP could enforce this, even in the case of
> concurrent merges.

I think IndexWriter should enforce it?  Ie no merge policy should be
allowed to leave segments in other dirs (= at inconsistent index) at
point of commit.

> Perhaps this is sufficient, but not necessary? I see it as simpler
> just to have the merge policy (abstractly) generate a set of
> non-conflicting merges and let someone else worry about scheduling
> them.

I like that idea :)  It fits well w/ the stateless API.  Ie, merge
policy returns all possible merges and "someone above" takes care of
scheduling them.

> > But, providing just a single concurrent merge already gains us
> > concurrency of merging with adding of docs.
>
> I'm worried about when you start the leftmost merge, that, say, is
> going to take a day. With a steady influx of docs, it's not going to
> be long before you need another merge and if you have only one
> thread, you're going to block for the rest of the day. You've bought
> a little concurrency, but it's the almost day-long block I really
> want to avoid.

Ahh ... very good point.  I agree.

> With a log-like policy, I think it's feasible to have logN
> threads. You might not want them all doing disk i/o at the same
> time: you'd want to prioritize threads on the small merges and/or
> suspend large merge threads.  The speed with which the larger merge
> threads can vary when other merges are taking place, you just have
> to not stop them and start over.

Agreed: CMP should do this.

> > Right, the LUCENE-845 merge policy doesn't look @ the return
> > result of "merge".  It just looks at the newly created
> > SegmentInfos.
>
> Yeah. My thinking was this would be tweaked. If merger.merge returns
> a valid number of docs, it could recurse as it does. If merger.merge
> returned -1 (which CMP does), it would not recurse but simply
> continue the loop.

Hmm.  This means each merge policy must know whether it's talking to
CMP or IndexWriter underneith?  With the stateless approach this
wouldn't happen.

> > H, in fact, I think your CMP wrapper would not work with the
> > merge policy in LUCENE-845, right?  Ie, won't it will just recurse
> > forever?  So actually I don't see how your CMP (using the current
> > API) can in general safely "wrap" around a merge policy w/o
> > breaking things?
>
> I think it's safe, just not concurrent. The recursion would generate
> the same set of segments to merge and CMP would make the second call
> block (abstractly, anyway: it actually throws an exception that
> unwinds the stack and causes the call to start again from the top
> when the conflicting merge finishes).

Oh I see...  that's kind of sneaky (planning on using exceptions to
abort a merge requested by the policy).  I think the stateless
approach would be cleaner here.

> > But, if you lock on IndexWriter, what about apps that use multiple
> > threads to add documents and but don't use CMP?  When one thread
> > gets tied up merging, you'll then block on the other synchronized
> > methods?  And you also can't flush from other threads either?  I
> > think flushing a new segment should be allowed to run concurrently
> > with the merge?
>
> I'm not sure I'm following this. That's what happens now, right? Are
> you trying to get more concurrency then there is now w/o using CMP?
> I certainly haven't been trying to do that.

True, this is something new.  But since you're already doing the work
to allow a merge to run in the BG without blocking adding of docs,
flushing, etc, wouldn't this come nearly for free?  Actually I think
all that's necessary, regardless of sync'ing on IndexWriter or
SegmentInfos is to move the "if (triggerMerge)" out of the
synchronized method/block.

> > I guess I don't see the reason to synchronize on IndexWriter
> > instead of segmentInfos.
>
> I looked at trying to make IW work when a synchronization of IW
> didn't imply a synchronization of segmentInfos. It's a very, very
> heavily used little data structure. I found it very hard to convince
> myself I could catch all the places locks would be requ

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-16 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520408
 ] 

Steven Parkes commented on LUCENE-847:
--

I don't think so: I think if someone changes the merge policy to
something else, it's fine to require that they then do settings
directly through that merge policy.

You're going to want to change the default merge policy, right?  So you're 
going to change the hard cast in IW to that policy? So it'll fail for anyone 
that wants to just getMergePolicy back to the old policy?

If that's the case, I'm going to keep those tests the way they are because when 
you do change the policy, I'm assuming you'll keep many of them, just add the 
manual setMergePolicy(), and they'll need to have those casts put back in?

Maybe we just put it in MergePolicy interface and let them throw (e.g., via 
MergePolicyBase) if called on an unsupported merge policy? That's moving from 
compile time checking to run time checking, but ... 

This is inside addIndexes that we're talking about.

Ah. Right.

I think we shouldn't allow any mergePolicy to
leave the index inconsistent (failing to copy over segments from other
directories).

That makes sense to me. CMP could enforce this, even in the case of concurrent 
merges.

No, after another N (= mergeFactor) flushes, it would return a new
suggested merge.

Okay. I think I'm following you here.

Here's what I understand: in your model, (1) each call to merge will only ever 
generate one merge thread (regardless of how many levels might be full) and (2) 
you can get concurrency out of this as long as you consider a level "merge 
worthy" as different from "full", i.e., blocking).

You did say  

> > But, we
> > could relax that such that if ever the lowest level has >
> > 2*mergeFactor pending segments to merge then we select the 2nd
> > set.

And I think you'd want to modify that to select the lowest sufficiently over 
subscribed level, not just the lowest level if it's oversubscribed?

Perhaps this is sufficient, but not necessary? I see it as simpler just to have 
the merge policy (abstractly) generate a set of non-conflicting merges and let 
someone else worry about scheduling them.

But, providing just a single concurrent merge already gains us
concurrency of merging with adding of docs.

I'm worried about when you start the leftmost merge, that, say, is going to 
take a day. With a steady influx of docs, it's not going to be long before you 
need another merge and if you have only one thread, you're going to block for 
the rest of the day. You've bought a little concurrency, but it's the almost 
day-long block I really want to avoid.

With a log-like policy, I think it's feasible to have logN threads. You might 
not want them all doing disk i/o at the same time: you'd want to prioritize 
threads on the small merges and/or suspend large merge threads.  The speed with 
which the larger merge threads can vary when other merges are taking place, you 
just have to not stop them and start over. 

Right, the LUCENE-845 merge policy doesn't look @ the return result of
"merge".  It just looks at the newly created SegmentInfos.

Yeah. My thinking was this would be tweaked. If merger.merge returns a valid 
number of docs, it could recurse as it does. If merger.merge returned -1 (which 
CMP does), it would not recurse but simply continue the loop.

H, in fact, I think your CMP wrapper would not work with the merge
policy in LUCENE-845, right?  Ie, won't it will just recurse forever?
So actually I don't see how your CMP (using the current API) can in
general safely "wrap" around a merge policy w/o breaking things?

I think it's safe, just not concurrent. The recursion would generate the same 
set of segments to merge and CMP would make the second call block (abstractly, 
anyway: it actually throws an exception that unwinds the stack and causes the 
call to start again from the top when the conflicting merge finishes).

But, if you lock on IndexWriter, what about apps that use multiple
threads to add documents and but don't use CMP?  When one thread gets
tied up merging, you'll then block on the other synchronized methods?
And you also can't flush from other threads either?  I think flushing
a new segment should be allowed to run concurrently with the merge?

I'm not sure I'm following this. That's what happens now, right? Are you trying 
to get more concurrency then there is now w/o using CMP? I certainly haven't 
been trying to do that.

I guess I don't
see the reason to synchronize on IndexWriter instead of segmentInfos.

I looked at trying to make IW work when a synchronization of IW didn't imply a 
synchronization of segmentInfos. It's a very, very heavily used little data 
st

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520374
 ] 

Michael McCandless commented on LUCENE-847:
---

> Note that merge() was added not for users (which I have no strong
> opinion about) but so that, potentially, CMP can check again for
> merges when a set of merge threads completes, i.e., cascade.

OK, got it.  In fact, then it seems more important that we NOT flush
at this point?

> > I think instead we should leave the methods, not deprecated, as
> > convenience (sugar) methods. Simple things should be simple;
> > complex things should be possible.
>
> I think this argues for a LegacyMergePolicy interface again, then?
> If we change the default merge policy and someone changes their code
> to use LogDoc for their own purposes, in both cases the
> getters/setters should work? So cast to the interface and as long as
> the merge policy supports this, the getters/setters work (unless the
> merge policy decides to throw within), otherwise the getters/setters
> throw?

I don't think so: I think if someone changes the merge policy to
something else, it's fine to require that they then do settings
directly through that merge policy.  I don't think we should bring
back the LegacyMergePolicy interface.

> > Uh, no: when someone calls optimize that means it really must be
> > done, right? So "optimize" is the right name I think.
> 
> Yeah, but it might do nothing. Just as merge might do nothing.

Well... that's the exception not the rule.  My vote would be for
"maybeMerge(...)"  and "optimize(..)".

> > Though, this raises the tricky question of index consistency ...
> 
> Definitely. I'm still trying to understand all the subtleties here.
>
> > IndexWriter commits the new segments file right after
> > mergePolicy.merge returns ... so for CMP we suddenly have an
> > unusable index (as seen by an IndexReader).
>
> How so? I figured that after mergePolicy.merge returns, in the case
> of CMP with an ongoing merge, segmentInfos won't have changed at
> all. Is that a problem?

This is inside addIndexes that we're talking about.  It will have
changed because the added indexes were stuck into the segmentInfos.
If you commit that segmentInfos, which now references segments in
other directories, the index is inconsistent, until the merge policy
finishes its work (including copying over segments from other dirs).
In fact this used to be an issue but was fixed in LUCENE-702.

> > Maybe it's too ambitious to allow merges of segments from other
> > directories to run concurrently?
> 
> Yeah, that might be the case. At least as a default?

I think it's worse: I think we shouldn't allow any mergePolicy to
leave the index inconsistent (failing to copy over segments from other
directories).  I think it's a bug if the mergePolicy does that and we
should check & raise an exception, and not commit the new segments
file.   IndexWriter should in general protect itself from a mergePolicy
that makes the index inconsistent (and, refuse to commit the resulting
segments file).

With the proposed "stateless API" we would keep calling the
mergePolicy, after each merge, until it returned null, and then do the
check that index is consistent.

> > I would consider it a hard error in IndexWriter if after calling
> > mergePolicy.merge from any of the addIndexes*, there remain
> > segments in other directories. I think we should catch this &
> > throw an exception?
>
> It would be easy enough for CMP to block in this case, rather than
> returning immediately. Wouldn't that be better? And I suppose it's
> possible to imagine an API on CMP for specifying this behavior?

I think CMP should indeed block in this case.  I wouldn't add an API
to change it.  It's too dangerous to allow an index to become
inconsistent.

> > I opened a separate issue LUCENE-982 to track this.
>
> I think this is good. I think it's an interesting issue but not
> directly related to the refactor?

I think it is related: we should not preclude it in this refactoring.
I think we should fix MergePolicy.optimize to take "int
maxNumSegments"?

> > Although ... do you think we need need some way for merge policy
> > to state where the new segment should be inserted into
> > SegmentInfos?
>
> Right now I assumed it would replace the left most-segment.
>
> Since I don't really know the details of what such a merge policy
> would like, I don't really know what it needs.
>
> If you've thought about this more, do you have a suggestion? I
> suppose we could just add an int. But, then again, I'd do that as a
> separate function, leaving the original available, so we can do this
> later, completely compatibly?

I don't have a suggestion :)  And I agree, this is safely postponed while
keeping future backwards compatibility, so, punt!

> > How can the user close IndexWriter and abort the running merges?  I
> > guess CMP would provide a method to abort any runni

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-16 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520293
 ] 

Steven Parkes commented on LUCENE-847:
--

One new small item: you've added a "public void merge()" to
IndexWriter so that people can externally kick off a merge request,
which is good I think.

But, is it really necessary to flush here?  It would be better to not
flush so that users then have two separate methods (flush() and
merge()) to do each function independently.

That makes sense.

Note that merge() was added not for users (which I have no strong opinion 
about) but so that, potentially, CMP can check again for merges when a set of 
merge threads completes, i.e., cascade.

I think instead we should leave the methods, not deprecated, as
convenience (sugar) methods.  Simple things should be simple; complex
things should be possible.

I think this argues for a LegacyMergePolicy interface again, then? If we change 
the default merge policy and someone changes their code to use LogDoc for their 
own purposes, in both cases the getters/setters should work? So cast to the 
interface and as long as the merge policy supports this, the getters/setters 
work (unless the merge policy decides to throw within), otherwise the 
getters/setters throw? 

Uh, no: when someone calls optimize that means it really must be done,
right?  So "optimize" is the right name I think.

Yeah, but it might do nothing. Just as merge might do nothing.

Can you factor this out, eg add a private method
"getLogDocMergePolicy(String reason)"

Sure.

Looks good, thanks.  Can you add javadocs (w/ params) for both of
these new methods?

Sure.

Though, this raises the tricky question of index
consistency ...

Definitely. I'm still trying to understand all the subtleties here.

IndexWriter commits the new segments file right after
mergePolicy.merge returns ... so for CMP we suddenly have an unusable
index (as seen by an IndexReader).

How so? I figured that after mergePolicy.merge returns, in the case of CMP with 
an ongoing merge, segmentInfos won't have changed at all. Is that a problem?

I thought the issue would be on the other end, where the concurrent merge 
finishes and needs to update segmentInfos.

Maybe it's too ambitious to allow merges of segments from other
directories to run concurrently?

Yeah, that might be the case. At least as a default?

I would consider it a hard error in IndexWriter if after calling
mergePolicy.merge from any of the addIndexes*, there remain segments
in other directories.  I think we should catch this & throw an
exception?

It would be easy enough for CMP to block in this case, rather than returning 
immediately. Wouldn't that be better? And I suppose it's possible to imagine an 
API on CMP for specifying this behavior?

I opened a separate issue LUCENE-982 to track this.

I think this is good. I think it's an interesting issue but not directly 
related to the refactor?

Although ... do you think we need need some way for merge policy to
state where the new segment should be inserted into SegmentInfos?

Right now I assumed it would replace the left most-segment.

Since I don't really know the details of what such a merge policy would like, I 
don't really know what it needs.

If you've thought about this more, do you have a suggestion? I suppose we could 
just add an int. But, then again, I'd do that as a separate function, leaving 
the original available, so we can do this later, completely compatibly?

Hmmm, does CMP block on close while it joins to any running merge
threads?

Yeah, at least in my sandbox.

How can the user close IndexWriter and abort the running
merges?  I guess CMP would provide a method to abort any running
merges, and user would first call that before calling
IndexWriter.close?

I hadn't really thought about this but I can see that should be made possible. 
It's always safe to abandon a merge so it should be available, for fast, safe, 
and clean shutdown.

True, LogDoc as it now stands would never exploit concurrency (it will
always return the highest level that needs merging).  But, we could
relax that such that if ever the lowest level has > 2*mergeFactor
pending segments to merge then we select the 2nd set.

Okay. But it will always return that? Still doesn't sound concurrent?

The thing is, the serial merge policy has no concept of concurrent merges, so 
if the API is always to select the best merge, until a pervious merge finishes, 
it will always return that as the best merge.

Concurrent is going to require, by hook or by crook, that a merge policy be 
able to generate a set of non-confli

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520260
 ] 

Michael McCandless commented on LUCENE-847:
---

One new small item: you've added a "public void merge()" to
IndexWriter so that people can externally kick off a merge request,
which is good I think.

But, is it really necessary to flush here?  It would be better to not
flush so that users then have two separate methods (flush() and
merge()) to do each function independently.

> > Are you going to fix all unit tests that call the now-deprecated
> > APIs?
> 
> Yeah. Thanks for the reminder.

On thinking about this more ... and on seeing all the diffs ... I no
longer feel we should be deprecating "get/setUseCompoundFile()" nor
"get/setMergeFactor()" nor "get/setMaxMergeDocs()" in IndexWriter.

The vast majoriy of Lucene users will not make their own merge policy
(just use the default merge policy) and so I don't think we should be
complicating their lives with having to now write lines like this when
they want to change settings:

   
((LogDocMergePolicy)writer.getMergePolicy()).setUseCompoundFile(useCompoundFile);

Also, this ties our hands if ever we want to change the default merge
policy (which, under LUCENE-845, I'd like to do).

I think instead we should leave the methods, not deprecated, as
convenience (sugar) methods.  Simple things should be simple; complex
things should be possible.  Sorry I didn't think of this before you
made the new patch Steve :(

> > I'm not wed to "maybeMerge()" but I really don't like "merge" :)
> > It's overloaded now.
> > 
> > EG IndexMerger interface has a method called "merge" that is named
> > correctly because it will specifically go a do the requested
> > merge.  So does IndexWriter.
> >
> > Then, you have other [overloaded] methods in LogDocMergePolicy
> > called "merge" that are named appropriately (they will do a
> > specific merge).
> > 
> > How about "checkForMerges()"?
>
> I don't find it ambiguous based on class and argument type. It's all
> personal, of course.
> 
> I'd definitely prefer maybe over checkFor because that sounds like a
> predicate.

OK let's settle on "maybeMerge"?

>- Does this mean optimize -> maybeOptimize, too?

Uh, no: when someone calls optimize that means it really must be done,
right?  So "optimize" is the right name I think.

> * Make LDMP casts not throw bad cast 

Can you factor this out, eg add a private method
"getLogDocMergePolicy(String reason)" that would be the one place that
does the class casting & throwing an error message from one single
source line?  Right now the message is copied in multiple places and,
it's wrong for mergeFactor (was copied from useCompoundFile).

> * Get rid of releaseMergePolicy and add doClose parameter on set

Looks good, thanks.  Can you add javadocs (w/ params) for both of
these new methods?

> As to the IndexWriter vs. IndexMerger issue, I still think having
> the interface is useful if not only that it makes my testing much
> easier. I have a MockIndexMerger that implements only the functions
> in the interface and therefore I can test merge policies without
> creating a writer. For me this has been a big win ...

Subclassing IndexWriter to make MockIndexMerger would also work for
testing?  This is what MockRAMDirectory does for example...

> > Exactly: these settings decide when a segment is flushed, so, why
> > put them into IndexMerger interface? They shouldn't have anything
> > to with merging; I think they should be removed.
> > 
> > For LUCENE-845 I'm working on a replacement for LogDocMergePolicy
> > that does not use maxBufferedDocs.

> I can see that one could write a merge policy that didn't have any
> idea of how the initial buffering was done, but I worry about
> precluding it. Maybe the LUCENE-845 patch will show a strong enough
> pattern to believe no merge policies will need it?

We wouldn't be precluding it (people can still get it from their
IndexWriter).  This is one of the big reasons that I don't like making
an interface out of IndexMerger: here we are having to pick & choose
which settings from IndexWriter a merge policy is "allowed" to use.  I
don't think that's necessary (we are just making extra work for
ourselves) and inevitably we won't get it right...

> > I think factoring into a base class is an OK solution, but, it
> > shouldn't be MergePolicy's job to remember to call this final
> > "move any segments in the wrong directory over" code. As long as
> > its in one place and people don't have to copy/paste code
> > between MergePolicy sources.
> 
> In the case of concurrent merges, I think this gets more
> complicated. When do you do those directory copies? I think you
> can't do them at the return from the merge policy because the merge
> policy may want to do them, but later.
>
> I don't think IndexWriter has enough information to know when the
> copies need to done. Doubly so if

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-14 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12519793
 ] 

Steven Parkes commented on LUCENE-847:
--

It just occurred to me that there is a neat way to handle deletes that
are flushed during a concurrent merge. For example, MergePolicy
decides to merge segments B and C, with B's delete file 0001 and
C's 100. When the concurrent merge finishes, B's delete file becomes
0011 and C's 110. We do a simple computation on the delete bit
vectors and check in the merged segment with delete file 00110.

Well, that makes my life much easier. Now I don't have to figure out what to 
do, just have to make it so ...

Thanks!

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-14 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12519790
 ] 

Steven Parkes commented on LUCENE-847:
--

Are you going to fix all unit tests that call the now-deprecated APIs?

Yeah. Thanks for the reminder.

As to the IndexWriter vs. IndexMerger issue, I still think having the interface 
is useful if not only that it makes my testing much easier. I have a 
MockIndexMerger that implements only the functions in the interface and 
therefore I can test merge policies without creating a writer. For me this has 
been a big win ...

Exactly: these settings decide when a segment is flushed, so, why put
them into IndexMerger interface?  They shouldn't have anything to with
merging; I think they should be removed.

For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that
does not use maxBufferedDocs.

I can see that one could write a merge policy that didn't have any idea of how 
the initial buffering was done, but I worry about precluding it. Maybe the 
LUCENE-845 patch will show a strong enough pattern to believe no merge policies 
will need it?

I think factoring into a base class is an OK solution, but, it
shouldn't be MergePolicy's job to remember to call this final "move
any segments in the wrong directory over" code.  As long as its in one
place and people don't have to copy/paste code between MergePolicy
sources.

In the case of concurrent merges, I think this gets more complicated. When do 
you do those directory copies? I think you can't do them at the return from the 
merge policy because the merge policy may want to do them, but later.

I don't think IndexWriter has enough information to know when the copies need 
to done. Doubly so if we have concurrent merges?

I still stand by it should be the merge policy making the choice. You could 
have the code in IndexWriter too, but then there'd be duplicate code. To put 
the code only in IndexWriter removes the choice from the merge policy.

I think there
should in fact be a default optimize() in the base class that does
what current IndexWriter now does so that a MergePolicy need not
implement optimize at all.

It'd be nice, but I don't know how to do it: the merge factor is not generic, 
so I don't know how to implement the loop generically.

Ah ... I see: with your forced merge ... hmmm.

No, forced would mean the merge policy must do a merge; whereas,
normally, it's free not to do a merge until it wants to.

H ...

I think adding a forced merge concept here is new ... If it's simply to support 
optimize, I'm not sure I find it too compelling. LogDoc as it stands uses 
different algorithms for incremental merges and optimize, so there's not too 
much of a concept of forced merges vs. optional merges to be factored out. So I 
guess I'm not seeing a strong compelling case for creating it?

Well, it's sort of awkward if you want to vary that max # segments.
Say during the day you optimize down to 15 segments every time you
update the index, but then at night you want to optimize down to 5.
If we don't add method to IndexWriter you then must have instance var
on your MergePolicy that you set, then you call optimize.  It's not
clean since really it should be a parameter.

Well, I don't know if I buy the argument that it should be a parameter. The 
merge policy has lots of state like docs/seg. I don't really see why 
segs/optimize is different.

My main reason for not wanting put this into IndexWriter is then every merge 
policy must support it.

Wait: there is a barrier, right?  In IndexWriter.replace we don't do
the right thing with non-contiguous merges?

Yeah, I meant that I'm not aware of any barriers except fixing 
IndexWriter#replace, in other words, I'm not aware of any other places where 
non-contiguity would cause a failure.

I'm not wed to "maybeMerge()" but I really don't like "merge" :)  It's
overloaded now.

EG IndexMerger interface has a method called "merge" that is named
correctly because it will specifically go a do the requested merge.
So does IndexWriter.

Then, you have other [overloaded] methods in LogDocMergePolicy called
"merge" that are named appropriately (they will do a specific merge).

How about "checkForMerges()"?

I don't find it ambiguous based on class and argument type. It's all personal, 
of course.

I'd definitely prefer maybe over checkFor because that sounds like a predicate.

Maybe we could add a "setMergePolicy(MergePolicy policy, boolean
doClose)" and default doClose to true?

That sounds good.

Finally: does MergePolicy really need a close()?

I think so. The concurrent merge policy maintains all sort

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518520
 ] 

Ning Li commented on LUCENE-847:


> Furthermore, I think this is all contained within IndexWriter, right?
> Ie when we go to "replace/checkin" the newly merged segment, this
> "merge newly flushed deletes" would execute at that time. And, I
> think, we would block flushes while this is happening, but
> addDocument/deleteDocument/updateDocument would still be allowed?

Yes and yes. :-)

> Couldn't we also just update the docIDs of pending deletes, and not
> flush? Ie we know the mapping of old -> new docID caused by the
> merge, so we can run through all deleted docIDs and remap? 

Hmm, I was worried quite a number of delete docIDs could be buffered,
but I guess it's still better than having to do a flush. So yes, this is better!

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518508
 ] 

Michael McCandless commented on LUCENE-847:
---

> It just occurred to me that there is a neat way to handle deletes
> that are flushed during a concurrent merge. For example, MergePolicy
> decides to merge segments B and C, with B's delete file 0001 and C's
> 100. When the concurrent merge finishes, B's delete file becomes
> 0011 and C's 110. We do a simple computation on the delete bit
> vectors and check in the merged segment with delete file 00110

Excellent!  This lets you efficiently merge in the additional deletes
(if any) that were flushed against each of the merged segments after
the merge had begun.  Furthermore, I think this is all contained
within IndexWriter, right?

Ie when we go to "replace/checkin" the newly merged segment, this
"merge newly flushed deletes" would execute at that time.  And, I
think, we would block flushes while this is happening, but
addDocument/deleteDocument/updateDocument would still be allowed?

It should in fact be quite fast to run since delete BitVectors is all
in RAM.

> I'm thinking about the impact of adding "deleteDocument(int doc)" on
> LUCENE-847, especially on concurrent merge. The semantics of
> "deleteDocument(int doc)" is that the document to delete is
> specified by the document id on the index at the time of the
> call. When a merge is finished and the result is being checked into
> IndexWriter's SegmentInfos, document ids may change. Therefore, it
> may be necessary to flush buffered delete doc ids (thus buffered
> docs and delete terms as well) before a merge result is checked in.
>
> The flush is not necessary if there is no buffered delete doc ids. I
> don't think it should be the reason not to support
> "deleteDocument(int doc)" in IndexWriter. But its impact on
> concurrent merge is a concern.

Couldn't we also just update the docIDs of pending deletes, and not
flush?  Ie we know the mapping of old -> new docID caused by the
merge, so we can run through all deleted docIDs and remap?


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518486
 ] 

Ning Li commented on LUCENE-847:


The following comments are about the impact on merge if we add
"deleteDocument(int doc)" (and deprecate IndexModifier). Since it
concerns the topic in this issue, I also post it here to get your opinions.

I'm thinking about the impact of adding "deleteDocument(int doc)" on
LUCENE-847, especially on concurrent merge. The semantics of
"deleteDocument(int doc)" is that the document to delete is specified
by the document id on the index at the time of the call. When a merge
is finished and the result is being checked into IndexWriter's
SegmentInfos, document ids may change. Therefore, it may be necessary
to flush buffered delete doc ids (thus buffered docs and delete terms
as well) before a merge result is checked in.

The flush is not necessary if there is no buffered delete doc ids. I
don't think it should be the reason not to support "deleteDocument(int
doc)" in IndexWriter. But its impact on concurrent merge is a concern.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Ning Li (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518453
 ] 

Ning Li commented on LUCENE-847:


On 8/8/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
> Actually I was talking about my idea (to "simplify MergePolicy.merge
> API").  With the simplification (whereby MergePolicy.merge just
> returns the MergeSpecification instead of driving the merge itself) I
> believe it's simple to make a concurrency wrapper around any merge
> policy, and, have all necessary locking for SegmentInfos inside
> IndexWriter.

I agree with Mike. In fact, MergeSelector.select, which is the counterpart
of MergePolicy.merge in the patch I submitted for concurrent merge,
simply returns a MergeSpecification. It's simple and sufficient to have
all necessary lockings for SegmentInfos in one class, say IndexWriter.
For example, IndexWriter locks SegmentInfos when MergePolicy(MergeSelector)
picks a merge spec. Another example, when a merge is finished, say
IndexWriter.checkin is called which locks SegmentInfos and replaces
the source segment infos with the target segment info.


On 8/7/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote:
> The synchronization is still tricky, since parts of segmentInfos are
> getting changed at various times and there are references and/or
> copies of it other places. And as Ning pointed out to me, we also
> have to deal with buffered delete terms. I'd say I got about 80% of
>the way there on the last go around. I'm hoping to get all the way
> this time.

It just occurred to me that there is a neat way to handle deletes that
are flushed during a concurrent merge. For example, MergePolicy
decides to merge segments B and C, with B's delete file 0001 and
C's 100. When the concurrent merge finishes, B's delete file becomes
0011 and C's 110. We do a simple computation on the delete bit
vectors and check in the merged segment with delete file 00110.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Ning Li
On 8/7/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote:
>
> [ 
> https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210
>  ]
>
> Steven Parkes commented on LUCENE-847:
> --
>
> Is the separate IndexMerger interface really necessary?
>
> I wrestled with this, so in the past, it wouldn't have taken much to convince 
> me otherwise. The reason for the extra interface is I was hoping to 
> discourage or create a little extra friction for merge policies in terms of 
> looking too much into the internals of IndexWriter.
>
> As an example, it's not a good idea for merge policies to be able to access 
> IndexWriter#segmentInfos directly. (That's a case I would like to solve by 
> making segmentInfos private, but I haven't looked at that completely yet and 
> even beyond that case, I'd still like merge policies to have a very clean 
> interface with IWs.)
>
> It feels kinda weird for me to be arguing for constraints since I work mostly 
> in dynamic languages that have none of this stuff. But it reflects my goal 
> that merge policies simply be algorithms, not real workers.
>
> Moreover, I think it may be useful for implementing concurrency (see below).
>
> While LogDocMergePolicy does need "maxBufferedDocs", I think,
> instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
> through" to the LogDocMergePolicy if that is the merge policy in
> use (and LogDocMergePolicy should store its own private
> "minMergeDocs").
>
> The thing here is that the merge policy has nothing to do with max buffered 
> docs, right? The code for buffering docs for the first segment is wholly in 
> the IndexWriter. LogDocMergePolicy happens to need it (in its current 
> incarnation) in order to handle the way the log levels are computed. This 
> could, of course, be changed. There's nothing that says a merge policy has to 
> look at these values, just that they're available should the merge policy 
> want to look.
>
> I guess my idea was that the index writer was responsible for handling the 
> initial segment (with DocsWriter, if it wants) and also to provide an 
> indication to the merge policies how it was handling them.
>
> It's possible to have the merge policy influence the first segment size but 
> that opens up a lot of issues. Does every merge policy then have to know how 
> to trade between buffering by doc count and buffering by ram? I was hoping to 
> avoid that.
>
> I'm not all that happy with this the way it is, but supporting both by-doc 
> and by-ram is messy but seems necessary. This was my take on making it least 
> messy?
>
> In LogDocMergePolicy, it seems like the checking that's done for
> whether a SegmentInfo is in a different directory, as well as the
> subsequent copy to move it over to the IndexWriter's directory,
> should not live in the MergePolicy?
>
> I see two parts to this.
>
> First, I hesitate to make it not possible for merge policy to see the 
> directory information, i.e., remove IndexMerger#getDirectory(). Copying a 
> segment is one way to get it into the current directory, but merging with 
> other segments does to. A merge policy could decide to go ahead and merge a 
> bunch of segments that are in other directories rather than just copy them 
> individually. Taking away getDirectory() removes that option.
>
> As to how to handle the case where single segments are copied, I guess my 
> main reason for leaving that in the merge policy would be for simplicity. 
> Seems nicer to have all segment amalgamation management in one place, rather 
> than some in one class and some in another. Could be factored into an 
> optional base merge policy for derived classes to use as they might like.
>
> The "checkOptimize" method in LogDocMergePolicy seems like it
> belongs back in IndexWriter: I think we don't want every
> MergePolicy having to replicate that tricky while condition.
>
> The reason for not doing this was I can imagine different merge policies 
> having a different model of what optimize means. I can imagine some policies 
> that would not optimize everything down to a single segment but instead 
> obeyed a max segment size. But we could factor the default conditional into 
> an optional base class, as above.
>
> It is an ugly conditional that there might be better way to formulate, so 
> that policies don't have to look at the grody details like hasSeparateNorms. 
> But I'd still like the merge policies to be able to decide what optimize 
> means at a high level.
>
> Maybe we could change MergePolicy.merge to take a boolean "forced"
> which, if true, means that the MergePolicy must now pick a merge
> even if it didn't think it was time to.  This would let us move
> that tricky while condition logic back into IndexWriter.
>
> I didn't follow this. forced == optimize? If not, what does pick a merge 
> 

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518435
 ] 

Michael McCandless commented on LUCENE-847:
---

New feedback:

  * Are you going to fix all unit tests that call the now-deprecated
APIs?  (You should still leave a few tests using the deprecated
APIs to make sure they in fact continue to work, but most tests
should not use the deprecated APIs).

Responses to previous feedback:

> As an example, it's not a good idea for merge policies to be able to
> access IndexWriter#segmentInfos directly. (That's a case I would
> like to solve by making segmentInfos private, but I haven't looked
> at that completely yet and even beyond that case, I'd still like
> merge policies to have a very clean interface with IWs.)

Agreed, but the solution to that is to make segmentInfos private, not
to make a whole new interface.  Ie, this is an IndexWriter problem, so
we should fix it in IndexWriter.

> > While LogDocMergePolicy does need "maxBufferedDocs", I think,
> > instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
> > through" to the LogDocMergePolicy if that is the merge policy in
> > use (and LogDocMergePolicy should store its own private
> > "minMergeDocs").
>
> The thing here is that the merge policy has nothing to do with max
> buffered docs, right? The code for buffering docs for the first
> segment is wholly in the IndexWriter. LogDocMergePolicy happens to
> need it (in its current incarnation) in order to handle the way the
> log levels are computed. This could, of course, be changed. There's
> nothing that says a merge policy has to look at these values, just
> that they're available should the merge policy want to look.

Exactly: these settings decide when a segment is flushed, so, why put
them into IndexMerger interface?  They shouldn't have anything to with
merging; I think they should be removed.

For LUCENE-845 I'm working on a replacement for LogDocMergePolicy that
does not use maxBufferedDocs.

> I guess my idea was that the index writer was responsible for
> handling the initial segment (with DocsWriter, if it wants) and also
> to provide an indication to the merge policies how it was handling
> them.
>
> It's possible to have the merge policy influence the first segment
> size but that opens up a lot of issues. Does every merge policy then
> have to know how to trade between buffering by doc count and
> buffering by ram? I was hoping to avoid that.

Yeah, I don't think merge policy should dictate flushing either;
especially because app logic above IndexWriter can already easily call
flush() if necessary.

> > In LogDocMergePolicy, it seems like the checking that's done for
> > whether a SegmentInfo is in a different directory, as well as the
> > subsequent copy to move it over to the IndexWriter's directory,
> > should not live in the MergePolicy?
>
> I see two parts to this.
> 
> First, I hesitate to make it not possible for merge policy to see
> the directory information, i.e., remove
> IndexMerger#getDirectory(). Copying a segment is one way to get it
> into the current directory, but merging with other segments does
> to. A merge policy could decide to go ahead and merge a bunch of
> segments that are in other directories rather than just copy them
> individually. Taking away getDirectory() removes that option.

Agreed, a "sophisticated" merge policy would go and merge segments in
other directories.  But, it should not have to.

I'm not proposing making it "not possible"; I'm proposing the merge
policy is given IndexWriter at which point it can getDirectory() from
it.  (Ie the extra interface solely for this purpose is overkill).

> As to how to handle the case where single segments are copied, I
> guess my main reason for leaving that in the merge policy would be
> for simplicity. Seems nicer to have all segment amalgamation
> management in one place, rather than some in one class and some in
> another. Could be factored into an optional base merge policy for
> derived classes to use as they might like.

I don't see this as simpler: I see it as making the MergePolicy
writer's job more complex.  I also see it as substantial duplicated
code (I just had to copy/paste a bunch of code in working on my
MergePolicy in LUCENE-845).

I think factoring into a base class is an OK solution, but, it
shouldn't be MergePolicy's job to remember to call this final "move
any segments in the wrong directory over" code.  As long as its in one
place and people don't have to copy/paste code between MergePolicy
sources.

> > The "checkOptimize" method in LogDocMergePolicy seems like it
> > belongs back in IndexWriter: I think we don't want every
> > MergePolicy having to replicate that tricky while condition.
>
> The reason for not doing this was I can imagine different merge
> policies having a different model of what optimize means. I can
> imagine some polici

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518263
 ] 

Michael McCandless commented on LUCENE-847:
---

> I'm thinking maybe a MergePolicy#useCompoundDocStore( SegmentInfos )
> makes sense? The naive case would still just use the static setting
> we have now, but we could think about a better implementation:

I think adding that new method to MergePolicy is great!  And, just
using the same "useCompoundFile" as normal segmetns is good for
starters (and, this matches what's done today).


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518251
 ] 

Steven Parkes commented on LUCENE-847:
--

Ah. I understand better now. I have to admit, I haven't kept up to date on some 
of the deeper file stuff in LUCENE-843.

It seems to me there's quite a bit of difference between segment files and doc 
store files. Since doc store files can be referred to by multiple segments, 
they can get quite large. I don't have any trouble imaging that a merge policy 
might want to CFS 10MB segments but not 10GB doc stores?

I'm thinking maybe a MergePolicy#useCompoundDocStore( SegmentInfos ) makes 
sense? The naive case would still just use the static setting we have now, but 
we could think about a better implementation:

- Maybe never cfs doc store files? Is that an unreasonable default? It just 
strikes me that there should be far fewer of these so that we don't need to and 
on the other end, if they are big, CFS'ing them is going to be expensive.
- Do something smart but easy depending on number and size

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518237
 ] 

Michael McCandless commented on LUCENE-847:
---

> Looking at IW, with the new DocsWriter stuff, it looks like there
> isn't a segmentInfo object for the new segment at the time the
> predicate is being called. Would it be possible to make one?
> Something like DocumentsWriter#getDocStoreSegmentInfo() analogous to
> DocumentsWriter#getDocStoreSegment()? It could be an object just
> thrown away.

Hmmm: it looks like you're calling
"mergePolicy.useCompoundFile(segmentInfos,null)" twice: once inside
flushDocStores and immediately thereafter when flushDocStores returns
into the flush() code.  Maybe you should change flushDocStores to
return whether or not the flushed files are in compound format
instead?

Since flushDocStores() is flushing only the "doc store" index files
(stored fields & term vectors) it's not a real "segment" so it's a
somewhat forced fit to make a SegmentInfo in this case.  I guess we
could make a "largely empty" SegmentInfo and fill in what we can, but
that's somewhat dangerous (eg methods like files() would have to be
fixed to deal with this).

Maybe, instead, we could use one of the SegmentInfo instances from a
segment that refers to this doc store segment?  This would just mean
stepping through all SegmentInfo's and finding the first one (say)
whose docStoreSegment equals the one we are now flushing?  Still it
would be good to differentiate this case (creating compound file for
doc store files vs for a real segment) to MergePolicy somehow (maybe
add a boolean arg "isDocStore" or some such?).

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518222
 ] 

Steven Parkes commented on LUCENE-847:
--

On a related note, Mike, there a few FIXME's in IW related to useCompoundFile: 
it doesn't exist in IW anymore (other than as a deprecated feature). The goal 
was to allow merge policies to decide when to use compound files, perhaps on a 
segment-by-segment basis. That all works fine for merge operations.

But there's also the case of new segments, what format they should be in. These 
are segments that are going to be created by IW (via DocsWriter?) My stab at 
this was to have IW ask the merge policy. Since this isn't a merge, per say, 
the IW passes to the merge policy the current set of segment infos and the new 
segment info, asking what format the new segment info should use. So 
MergePolicy has

boolean useCompoundFile(SegmentInfos segments, SegmentInfo newSegment);

Looking at IW, with the new DocsWriter stuff, it looks like there isn't a 
segmentInfo object for the new segment at the time the predicate is being 
called. Would it be possible to make one? Something like 
DocumentsWriter#getDocStoreSegmentInfo() analogous to 
DocumentsWriter#getDocStoreSegment()? It could be an object just thrown away.

Is this a bad idea?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210
 ] 

Steven Parkes commented on LUCENE-847:
--

Is the separate IndexMerger interface really necessary?

I wrestled with this, so in the past, it wouldn't have taken much to convince 
me otherwise. The reason for the extra interface is I was hoping to discourage 
or create a little extra friction for merge policies in terms of looking too 
much into the internals of IndexWriter.

As an example, it's not a good idea for merge policies to be able to access 
IndexWriter#segmentInfos directly. (That's a case I would like to solve by 
making segmentInfos private, but I haven't looked at that completely yet and 
even beyond that case, I'd still like merge policies to have a very clean 
interface with IWs.)

It feels kinda weird for me to be arguing for constraints since I work mostly 
in dynamic languages that have none of this stuff. But it reflects my goal that 
merge policies simply be algorithms, not real workers.

Moreover, I think it may be useful for implementing concurrency (see below).

While LogDocMergePolicy does need "maxBufferedDocs", I think,
instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
through" to the LogDocMergePolicy if that is the merge policy in
use (and LogDocMergePolicy should store its own private
"minMergeDocs").

The thing here is that the merge policy has nothing to do with max buffered 
docs, right? The code for buffering docs for the first segment is wholly in the 
IndexWriter. LogDocMergePolicy happens to need it (in its current incarnation) 
in order to handle the way the log levels are computed. This could, of course, 
be changed. There's nothing that says a merge policy has to look at these 
values, just that they're available should the merge policy want to look.

I guess my idea was that the index writer was responsible for handling the 
initial segment (with DocsWriter, if it wants) and also to provide an 
indication to the merge policies how it was handling them.

It's possible to have the merge policy influence the first segment size but 
that opens up a lot of issues. Does every merge policy then have to know how to 
trade between buffering by doc count and buffering by ram? I was hoping to 
avoid that.

I'm not all that happy with this the way it is, but supporting both by-doc and 
by-ram is messy but seems necessary. This was my take on making it least messy?

In LogDocMergePolicy, it seems like the checking that's done for
whether a SegmentInfo is in a different directory, as well as the
subsequent copy to move it over to the IndexWriter's directory,
should not live in the MergePolicy?

I see two parts to this.

First, I hesitate to make it not possible for merge policy to see the directory 
information, i.e., remove IndexMerger#getDirectory(). Copying a segment is one 
way to get it into the current directory, but merging with other segments does 
to. A merge policy could decide to go ahead and merge a bunch of segments that 
are in other directories rather than just copy them individually. Taking away 
getDirectory() removes that option.

As to how to handle the case where single segments are copied, I guess my main 
reason for leaving that in the merge policy would be for simplicity. Seems 
nicer to have all segment amalgamation management in one place, rather than 
some in one class and some in another. Could be factored into an optional base 
merge policy for derived classes to use as they might like.

The "checkOptimize" method in LogDocMergePolicy seems like it
belongs back in IndexWriter: I think we don't want every
MergePolicy having to replicate that tricky while condition.

The reason for not doing this was I can imagine different merge policies having 
a different model of what optimize means. I can imagine some policies that 
would not optimize everything down to a single segment but instead obeyed a max 
segment size. But we could factor the default conditional into an optional base 
class, as above.

It is an ugly conditional that there might be better way to formulate, so that 
policies don't have to look at the grody details like hasSeparateNorms. But I'd 
still like the merge policies to be able to decide what optimize means at a 
high level.

Maybe we could change MergePolicy.merge to take a boolean "forced"
which, if true, means that the MergePolicy must now pick a merge
even if it didn't think it was time to.  This would let us move
that tricky while condition logic back into IndexWriter.

I didn't follow this. forced == optimize? If not, what does pick a merge mean? 
Not sure what LogDoc should do for merge(force=true) if it thinks everything is 
fine?

Also, I think at some point we may want to have an optimize()
method that takes an int parameter stating the max # segments in
  

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518186
 ] 

Michael McCandless commented on LUCENE-847:
---

> > I think we ideally would like concurrency to be fully independent of
> > the merge policy.
>
> I thought of that, too, while taking a fresh look at things
> again. It's my current approach, though I'm not yet sure there won't
> be stumbling blocks. More soon, hopefully.

Well I think the current MergePolicy API (where the "merge" method
calls IndexWriter.merge itself, must cascade itself, etc.) makes it
hard to build a generic ConcurrentMergePolicy "wrapper" that you could
use to make any MergePolicy concurrent (?).  How would you do it?

EG I'm working on a new MergePolicy for LUCENE-845, which would be
nice to run concurrently, but I'd really rather not have to figure out
how to build my own concurrency/locking/etc in it.  Ideally
"concurrency" is captured as a single wrapper class that we all can
re-use on top of any MergePolicy.  I think we can do that with the
proposed simplification.

> > I think with one change to your MergePolicy API & control flow, we
> > could make this work very well: instead of requiring the MergePolicy
> > to call IndexWriter.merge, and do the cascading, it should just
> > return the one MergeSpecification that should be done right now.

> Hmm ... interesting idea. I thought about it briefly, though I
> didn't pursue it (see below). It would end up changing the possible
> space of merge policies subtly. You wouldn't be able to have any
> state in the algorithm. Arguably this is a good thing. There is also
> a bit more overhead, since starting the computation of potential
> merges from scratch each time could imply a little more computation,
> but I suspect this is not significant.

I think you can still have state (as instance variables in your
class)?  How would this simplification restrict the space of merge
policies?

> > When the inner MergePolicy wants to do a merge, the
> > ConcurrentMergePolicy would in turn kick off that merge in the BG but
> > then return null to the IndexWriter allowing IndexWriter to return to
> > its caller, etc.
>
> I'm a little unsure here. Are you saying the ConcurrentMergePolicy
> does the merges itself, rather than using the writer? That's going
> to mean a synchronization dance between the CMP and the
> writer. There's no question but that there has to be some synch
> dance, but my current thinking was to try to keep as cleanly within
> one class, IW, as I could.

Oh, no: ConcurrentMergePolicy would still call IndexWriter.merge(spec),
just with a separate thread.  And so all synchronization required is
still inside IndexWriter (I think?).

In fact, if we stick with the current MergePolicy API, aren't you
going to have to put some locking into eg the LogDocMergePolicy when
concurrent merges might be happening?  With the new approach,
IndexWriter could invoke MergePolicy.merge under a
"synchronized(segmentInfos)", and then each MergePolicy doesn't have
to deal with locking at all.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518184
 ] 

Michael McCandless commented on LUCENE-847:
---


Some more feedback:

  - Is the separate IndexMerger interface really necessary?  Can't we
just use IndexWriter directly?  It's somewhat awkward/forced now,
because the interface has getters for ramBufferSizeMB and
maxBufferedDocs that are really a "writer" (flushing) thing not a
"merging" thing.

While LogDocMergePolicy does need "maxBufferedDocs", I think,
instead, in IndexWriter's "setMaxBufferedDocs()" it should "write
through" to the LogDocMergePolicy if that is the merge policy in
use (and LogDocMergePolicy should store its own private
"minMergeDocs").

I think the three getters may not even be needed (based on
comments below), in which case it seems like we shouldn't be
creating a new interface.

  - In LogDocMergePolicy, it seems like the checking that's done for
whether a SegmentInfo is in a different directory, as well as the
subsequent copy to move it over to the IndexWriter's directory,
should not live in the MergePolicy?

Otherwise, every MergePolicy is going to have to duplicate this
code.  Not to mention, we may someday create a more efficient way
to copy than running a single-segment merge (which is a very
inefficient, but, we only do it in addIndexes* I think).  I'd like
to capture this in one place (IndexWriter).

EG, the writer could have its own "copyExternalSegments" method
which is called in addIndexes* and also optimize(), after the
merge policy has done its merge, which does the check for wrong
directory and subsequent copy.

I think this would mean IndexMerger.getDirectory() is not really
needed.

  - The "checkOptimize" method in LogDocMergePolicy seems like it
belongs back in IndexWriter: I think we don't want every
MergePolicy having to replicate that tricky while condition.

Maybe we could change MergePolicy.merge to take a boolean "forced"
which, if true, means that the MergePolicy must now pick a merge
even if it didn't think it was time to.  This would let us move
that tricky while condition logic back into IndexWriter.

Also, I think at some point we may want to have an optimize()
method that takes an int parameter stating the max # segments in
the resulting index.  This would allow you to optimize down to <=
N segments w/o paying full cost of a complete "only one segment"
optimize.  If we had a "forced" boolean then we could build such
an optimize method in the future, whereas as it stands now it
would not be so easy to retrofit previously created MergePolicy
classes to do this.

And some more minor feedback:

  - I love the simplification of addIndexesNoOptimize :)  Though (same
comment as above) I think we should move that final "copy if
different directory" step back in IndexWriter.

  - There are some minor things that should not be committed eg the
added "infoStream = null" in IndexFileDeleter.  I typically try to
put a comment "// nocommit" above such changes as I make them to
remind myself and keep them from slipping in.

  - In the deprecated IndexWriter methods you're doing a hard cast to
LogDocMergePolicy which gives a bad result if you're using a
different merge policy; maybe catch the class cast exception (or,
better, check upfront if it's an instanceof) and raise a more
reasonable exception if not?

  - IndexWriter around line 1908 has for loop that has commented out
"System.err.println"; we should just comment out/remove for loop

  - These commented out synchronized spook me a bit:

  /* synchronized(segmentInfos) */ {

Are they needed in these contexts?  Is this only once we have
concurrent merging?  (This ties back to the first feedback to
simplify MergePolicy API so that this kind of locking is only
needed inside IndexWriter).

  - Can we support non-contiguous merges?  If I understand it
correctly, the MergeSpecification can express such a merge, it's
just that the current IndexMerger (IndexWriter) cannot execute it,
right?  So we are at least not precluding fixing this in the
future.

  - It confuses me that MergePolicy has a method "merge(...)" -- can
we rename it to "maybeMerge(..)" or "checkForMerge(...)"?  Ie,
this method determines whether a merge is necessary and, if so, it
then asks IndexMerger to in fact execute the merge (or, returns
the spec)?

  - Instead of IndexWriter.releaseMergePolicy() can we have
IndexWriter only close the merge policy if it was the one that had
created it?  (Similar to how IndexWriter closes the dir if it has
opened it from a String or File, but does not close it if it was
passed in).



> Factor merge policy out of IndexW

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518165
 ] 

Steven Parkes commented on LUCENE-847:
--

I
think we ideally would like concurrency to be fully independent of the
merge policy.

I thought of that, too, while taking a fresh look at things again. It's my 
current approach, though I'm not yet sure there won't be stumbling blocks. More 
soon, hopefully.

I think with one change to your MergePolicy API & control flow, we
could make this work very well: instead of requiring the MergePolicy
to call IndexWriter.merge, and do the cascading, it should just return
the one MergeSpecification that should be done right now.

Hmm ... interesting idea. I thought about it briefly, though I didn't pursue it 
(see below). It would end up changing the possible space of merge policies 
subtly. You  wouldn't be able to have any state in the algorithm. Arguably this 
is a good thing. There is also a bit more overhead, since starting the 
computation of potential merges from scratch each time could imply a little 
more computation, but I suspect this is not significant.

When the inner MergePolicy wants
to do a merge, the ConcurrentMergePolicy would in turn kick off that
merge in the BG but then return null to the IndexWriter allowing
IndexWriter to return to its caller, etc.

I'm a little unsure here. Are you saying the ConcurrentMergePolicy does the 
merges itself, rather than using the writer? That's going to mean a 
synchronization dance between the CMP and the writer. There's no question but 
that there has to be some synch dance, but my current thinking was to try to 
keep as cleanly within one class, IW, as I could.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518162
 ] 

Michael McCandless commented on LUCENE-847:
---

This looks great Steve!

More specific feeedback soon, but ... in thinking about concurrency
(and from reading your comments about it in LogDocMergePolicy), I
think we ideally would like concurrency to be fully independent of the
merge policy.

Ie, just like you can take any shell command and choose to run it in
the background by sticking an "&" on the end, I should be able to take
my favorite MergePolicy instance X and "wrap" it inside a "concurrent
merge policy wrapper".  Eg, instantiate ConcurrentMergePolicy(X), and
then ConcurrentMergePolicy would take the merges requested by X and do
them in the background.

I think with one change to your MergePolicy API & control flow, we
could make this work very well: instead of requiring the MergePolicy
to call IndexWriter.merge, and do the cascading, it should just return
the one MergeSpecification that should be done right now.  This would
mean the "MergePolicy.merge" method would return null if no merge is
necessary right now, and would return a MergeSpecification if a merge
is required.

This way, it is IndexWriter that would execute the merge.  When the
merge is done, IndexWriter would then call the MergePolicy again to
give it a chance to "cascade".  This simplifies the locking because
IndexWriter can synchronize on SegmentInfos when it calls
"MergePolicy.merge" and so MergePolicy no longer has to deal with this
complexity (that SegmentInfos change during merge).

Then, with this change, we could make a ConcurrentMergePolicy that
could (I think) easily "wrap" itself around another MergePolicy X,
intercepting the calls to "merge".  When the inner MergePolicy wants
to do a merge, the ConcurrentMergePolicy would in turn kick off that
merge in the BG but then return null to the IndexWriter allowing
IndexWriter to return to its caller, etc.

Then, this also simplifies MergePolicy implementations because you no
longer have to deal w/ thread safety issues around driving your own
merges, cascading merges, dealing with sneaky SegmentInfos changing
while doing the merge, etc


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-06 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518013
 ] 

Steven Parkes commented on LUCENE-847:
--

For the time being, the patch also contains some of the code from LUCENE-971 
since that's how I test it.

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-05-03 Thread Ning Li

Having the merge policy own segmentInfos sounds kind of hard to me.
Among other things, there's a lot of code in IndexWriter for managing
segmentInfos with regards to transactions. I'm pretty wary of touching
that code. Is there a way around that?


But conceptually, do you agree it's a good idea for MergePolicy to own
segmentInfos?

I know the code for managing segmentInfos w.r.t transactions is hard.
I had to modify it when coding deleteDocuments for IndexWriter. I
wonder, since exceptions are the rare case, can we simply restore
segmentInfos from the last valid segments file? Hopefully it's simple
thus easy to maintain as well. Opinions?

Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-05-02 Thread Steven Parkes
By atomic, I meant that the writer knows the policy and the policy knows
the writer, that it wouldn't be possible for to set a merge policy on a
writer without the merge policy "knowing" that. It's easy enough to
implement with Chris's code (which, I think, makes it a compile error)
or with a simple function for writers to register themselves with the
merge policy when the merge policy is set.

Having the merge policy own segmentInfos sounds kind of hard to me.
Among other things, there's a lot of code in IndexWriter for managing
segmentInfos with regards to transactions. I'm pretty wary of touching
that code. Is there a way around that?

-Original Message-
From: Ning Li [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 02, 2007 7:54 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-847) Factor merge policy out of
IndexWriter

On 3/23/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote:
> In fact, there a few things here that are fairly subtle/important. The
relationship/protocol between the writer and policy is pretty strong.
This can be seen in the strawman concurrent merge code where the merge
policy holds state and relies on being called from a synchronized writer
method.   If that goes forward anything like it is, it would argue for
tightening that api up some. Chris suggested a way to make the
writer<>polcy relationship "atomic". I didn't add the code (yet) but I'm
not against it.


What does "atomic" mean here?

I'm thinking, can we move segmentInfos from IndexWriter to merge
policy? This way, IndexWriter is in charge of in-mem part and inserts
and deletes, and merge policy is in charge of disk part and merges.
Then only IndexWriter calls methods in merge policy but not the other
way around. Also, it'll be much easier to support concurrent merges in
a merge policy when it owns segmentInfos.

Just my two cents.

Regards,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-05-02 Thread Ning Li

On 3/23/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote:

In fact, there a few things here that are fairly subtle/important. The relationship/protocol 
between the writer and policy is pretty strong. This can be seen in the strawman concurrent 
merge code where the merge policy holds state and relies on being called from a synchronized 
writer method.   If that goes forward anything like it is, it would argue for tightening that 
api up some. Chris suggested a way to make the writer<>polcy relationship 
"atomic". I didn't add the code (yet) but I'm not against it.



What does "atomic" mean here?

I'm thinking, can we move segmentInfos from IndexWriter to merge
policy? This way, IndexWriter is in charge of in-mem part and inserts
and deletes, and merge policy is in charge of disk part and merges.
Then only IndexWriter calls methods in merge policy but not the other
way around. Also, it'll be much easier to support concurrent merges in
a merge policy when it owns segmentInfos.

Just my two cents.

Regards,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-04-19 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490192
 ] 

Steven Parkes commented on LUCENE-847:
--

Here are some numbers comparing the load performance for the factored vs. 
non-factored merge policies.

The setup uses enwiki, loads 200K documents, and uses 4 different combinations 
of maxBufferedDocs and mergeFactor (just the default from the standard 
benchmark, not because I necessarily thought it was a good idea.)

The factored merge policy seems to be on the order of 1% slower loading than 
the non-factored version ... and I'm not sure why, so I'm going to check into 
this. The factored version does more examination of segment list than the 
non-factored version, so there's compute overhead, but I would expect that to 
be swamped by I/O Maybe that's not a good assumption? Or it might be doing 
different merges for reasons I haven't considered, which I'm going to check.

Relating this to some of the merge discussions, I'm going to look at monitoring 
both the number of merges taking place and the size of those merges. I think 
that's helpful in understand different candidate merge policies, in addition to 
absolute difference in runtime.

I also think histogramming  the per-doc cost would also be interesting, since 
mitigating the long delay at cascading merges is at least one goal of a 
concurrent merge policy.

And all this doesn't even consider testing the recent stuff on merging multiple 
indexes. That's an area where the factored merge policy differs (because of the 
simpler interface.)

I'm curious if anyone is surprised by these numbers, the 60 docs/sec, in 
particular. This machine is a dual dual-core xeon, writing to a single local 
disk.  My dual opty achieved ~85-100 docs/sec on a three disk SATA3 RAID5 array.

Non-factored (current) merge policy

 [java] > Report sum by Prefix (MAddDocs) and Round (8 about 8 
out of 33)
 [java] Operation   round mrg buf   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] MAddDocs_20 0  10  101   20 41.6
4,804.1111,758,928 12,591,104
 [java] MAddDocs_20 -   1 100  10 -  -   1 -  -  20 -  -  - 50.0 -  
4,000.25 -  34,831,992 -   52,563,968
 [java] MAddDocs_20 2  10 1001   20 49.9
4,004.9542,158,232 60,444,672
 [java] MAddDocs_20 -   3 100 100 -  -   1 -  -  20 -  -  - 57.9 -  
3,455.97 -  45,646,680 -   61,083,648
 [java] MAddDocs_20 4  10  101   20 44.9
4,458.6636,928,616 61,083,648
 [java] MAddDocs_20 -   5 100  10 -  -   1 -  -  20 -  -  - 50.4 -  
3,965.98 -  47,855,064 -   61,083,648
 [java] MAddDocs_20 6  10 1001   20 49.7
4,023.5152,506,448 64,217,088
 [java] MAddDocs_20 -   7 100 100 -  -   1 -  -  20 -  -  - 57.9 -  
3,451.39 -  64,466,128 -   73,220,096

Factored (new) merge policy

 [java] > Report sum by Prefix (MAddDocs) and Round (8 about 8 
out of 33)
 [java] Operation   round mrg buf   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] MAddDocs_20 0  10  101   20 41.4
4,828.2510,477,976 12,386,304
 [java] MAddDocs_20 -   1 100  10 -  -   1 -  -  20 -  -  - 50.4 -  
3,968.27 -  38,333,544 -   46,170,112
 [java] MAddDocs_20 2  10 1001   20 50.3
3,973.5233,539,824 63,860,736
 [java] MAddDocs_20 -   3 100 100 -  -   1 -  -  20 -  -  - 58.6 -  
3,413.87 -  44,580,528 -   87,781,376
 [java] MAddDocs_20 4  10  101   20 45.3
4,411.5057,850,104 87,781,376
 [java] MAddDocs_20 -   5 100  10 -  -   1 -  -  20 -  -  - 51.0 -  
3,921.48 -  62,793,432 -   87,781,376
 [java] MAddDocs_20 6  10 1001   20 50.4
3,969.8749,625,496 93,966,336
 [java] MAddDocs_20 -   7 100 100 -  -   1 -  -  20 -  -  - 58.7 -  
3,409.51 -  68,100,288 -  129,572,864


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: concurrentMerge.patch, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online

RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Michael McCandless

"Steven Parkes" <[EMAIL PROTECTED]> wrote:
> I've been wondering about taking minMergeDocs out of LMP
> (LogarithmicMergePolicy): if IW is doing maxBufferedDocs, can we get by
> with
>   ceil(log(docs))
> rather than
>   ceil(log(ceil(docs/minMergeDocs))
> (That's not exactly right, but it's close). The simplicity appeals to
> me, but ...

I think we could do that?

Though if we change the default to be "by #bytes used by each segment"
(for the new default "by size" merge policy) then we can disregard
#docs in a segment during merging entirely?  (And then, leave the "by
#docs" legacy merge policy as is?).

> If we remove these from the MergePolicy interface then maybe we
> don't need MergePolicyBase?  (Just to makes things simpler).
> 
> Just a DRY class. I have no strong feeling about this. In fact, I went
> back and forth on it. It's served as a placeholder while I experimented.

Got it.  I was thinking once we removed these params from the base
then there was even less "repeating" to worry about.

>   * I was a little spooked by this change to TestAddIndexesNoOptimize:
> 
>   -assertEquals(2, writer.getSegmentCount());
>   +assertEquals(3, writer.getSegmentCount());
> 
> I think with just the refactoring, there should not need to be any
> changes to unit tests right?
> 
> I don't know if I this got into what I wrote either in e-mail or in the
> start of the comments. I guess I've done two steps in one here: the
> factoring isn't just renaming methods and classes. I did create an
> MergePolicy interface that is has a slight simplificatin on how the
> merge policy is currently implemented.

Ahhh, sorry, I missed that this was not a pure refactoring.  I think
you did mention this.  OK now that I understand the issue better, I
agree, let's keep the merge policy interface simple.  I think the
merge policy should not need to know the "history" of how the segments
came to be in this index (addIndexes, flush, etc); instead, it should
look at them now and decide 1) whether to merge, and 2) which specific
segments to merge.

>   * It's interesting that you've pulled "useCompoundFile" into the
> LegacyMergePolicy.  I'm torn on whether it belongs in MergePolicy
> at all, since this is really a file format issue?
> 
> Well, the idea was here that you might want to use non-compound files
> for big segments (since you have few of them) and compound for smaller
> segments. It basically reflects the idea that to some extent, the merge
> policy is factoring the number of file descriptors required into its
> decision.

Ahh that's a good idea!  I guess we could look at compound file as a
form of merging: you've merged many files into a single file in order
to save on file-descriptors.  OK I think that (moving decision of CFS
or not for a given segment, and, for a newly flushed segment, into the
merge policy) makes sense.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Steven Parkes
Yes, I'll separate out issues related to the basic refactor before
submitting a candidate patch. I actually thought it might be helpful to
keep it in the rough version to see context. But I can do that at any
time ...

With the factored merge policy, it's easy enough to create a merge
policy on size parallel to the one on docs. Hmmm ... suppose one could
use derivation of one from the other or from a common base given the
appropriate factoring of "size" in the algorithm.

I really want to do some larger tests of this. I've downloaded Wikipedia
and plan to add support for it in the benchmarker stuff (if anyone else
is doing this, can you stop me now?) I figure I'll try it on my main
machine and my laptop. My main machine has a lot of RAM and I wonder how
big the corpus has to get before you see signficant changes.

-Original Message-
From: Michael McCandless (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Sunday, March 25, 2007 5:47 AM
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-847) Factor merge policy out of
IndexWriter


[
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira
.plugin.system.issuetabpanels:comment-tabpanel#action_12483929 ] 

Michael McCandless commented on LUCENE-847:
---

Steven, I looked through the patch quickly.  It looks great!  First
some general comments and then I'll add more specifics as
separate comments.

Can you open separate issues for the other new and interesting merge
policies here?  I think the refactoring of merge policy plus creation
of the default policy that is identical to today's merge policy, which
should be a fairly quick and low-risk operation, would then remain
under this issue?

Then, iterating / vetting / debugging the new interesting merge
policies can take longer under their own separate issues and time
frame.

On staging I think we could first do this issue (decouple MergePolicy
from writer), then LUCENE-845 because it blocks LUCENE-843 (which
would then be fixing LogarithmicMergePolicy to use segment sizes
instead of docs counts as basis for determing levels) then LUCENE-843
(performance improvements for how writer uses RAM)?




> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it
pluggable, making it possible for apps to choose a custom merge policy
and for easier experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Michael McCandless

"Steven Parkes" <[EMAIL PROTECTED]> wrote:
> Well, with all due respect, I don't find whitespace malignant ...

Oh, sorry.  I call it "cancerous" because it has a tendency to
spread uncontrollably throughout the source code :)

> That said, I don't get into this anymore. I make all the necessary
> whitespace changes at the end. When making a candidate patch, I go
> through it line by line looking for whitespace/style changes that I've
> inadvertently added and take them out. In a case like this, where I've
> been writing something from scratch, I just take a valium first.
> 
> It's no big deal to me.

OK, thanks ;)

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Michael McCandless

"Steven Parkes" <[EMAIL PROTECTED]> wrote:
> Yes, I'll separate out issues related to the basic refactor before
> submitting a candidate patch. I actually thought it might be helpful to
> keep it in the rough version to see context. But I can do that at any
> time ...

OK, that makes sense to leave it as one patch until things get closer
to ready.

> With the factored merge policy, it's easy enough to create a merge
> policy on size parallel to the one on docs. Hmmm ... suppose one could
> use derivation of one from the other or from a common base given the
> appropriate factoring of "size" in the algorithm.

Factoring out just the determination of "size" would be nice.

Given how serious LUCENE-845 is (over-merging when flushing by RAM) I
think we should in fact switch the default merge policy to by "by
size" rather than "by doc count"?  (But keep the "by doc count"
version available in case people want to switch back?).

Especially with LUCENE-843 (where I plan to change writer to flush by
RAM usage by default) we need the default merge policy to not have
this bug.

> I really want to do some larger tests of this. I've downloaded Wikipedia
> and plan to add support for it in the benchmarker stuff (if anyone else
> is doing this, can you stop me now?) I figure I'll try it on my main
> machine and my laptop. My main machine has a lot of RAM and I wonder how
> big the corpus has to get before you see signficant changes.

That sounds awesome!  I'd love to use Wikipedia for testing LUCENE-843
as well.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Steven Parkes
Well, with all due respect, I don't find whitespace malignant ...

That said, I don't get into this anymore. I make all the necessary
whitespace changes at the end. When making a candidate patch, I go
through it line by line looking for whitespace/style changes that I've
inadvertently added and take them out. In a case like this, where I've
been writing something from scratch, I just take a valium first.

It's no big deal to me.

-Original Message-
From: Michael McCandless (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Sunday, March 25, 2007 5:49 AM
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-847) Factor merge policy out of
IndexWriter


[
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira
.plugin.system.issuetabpanels:comment-tabpanel#action_12483930 ] 

Michael McCandless commented on LUCENE-847:
---

My first comment, which I fear will be the most controversial feedback
here :), is a whitespace style question: I'm not really a fan of
"cancerous whitespace" where every ( [ etc has its own whitespace
around it.

I generally prefer minimal whitespace within reason (ie as long as it
does not heavily hurt readability).  The thing is I don't know that
Lucene has settled on this / if anyone else shares my opinion?  It
does seem that "two space indentation" is the standard, which I agree
with, but I don't know if whitespace style has otherwise been agreed
on?  Many people will say it's unimportant to agree on whitespace but
I feel it's actually quite important.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it
pluggable, making it possible for apps to choose a custom merge policy
and for easier experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Steven Parkes
   * I think maxBufferedDocs should not be exposed in any *MergePolicy
classes or interfaces?  I'm planning on deprecating this param
with LUCENE-843 when we switch by default to "buffering by RAM
usage" and it really relates to "how/when should writer flush its
RAM buffer".

  * I also think "minMergeDocs" (which today is one and the same as
maxBufferedDocs in IndexWriter but conceptually could be a
different configuration) also should not appear in the MergePolicy
interface.  I think it should only appear in
LogarithmicMergePolicy?

Yeah, I've thought about these two. As far as I've been able to see,
making maxBufferedDocs an IW (IndexWriter) thing is good.

I've been wondering about taking minMergeDocs out of LMP
(LogarithmicMergePolicy): if IW is doing maxBufferedDocs, can we get by
with
ceil(log(docs))
rather than
ceil(log(ceil(docs/minMergeDocs))
(That's not exactly right, but it's close). The simplicity appeals to
me, but ...

If we remove these from the MergePolicy interface then maybe we
don't need MergePolicyBase?  (Just to makes things simpler).

Just a DRY class. I have no strong feeling about this. In fact, I went
back and forth on it. It's served as a placeholder while I experimented.

  * I think we should not create a LegacyMergePolicy interface.

Yeah, I agree with this. Hard coding LMP in IW made me nervous, but
you're right, it's only in there long enough to support the deprecated
methods. Anybody adding a new merge policy doesn't need to rely on this.

  * I was a little spooked by this change to TestAddIndexesNoOptimize:

  -assertEquals(2, writer.getSegmentCount());
  +assertEquals(3, writer.getSegmentCount());

I think with just the refactoring, there should not need to be any
changes to unit tests right?

I don't know if I this got into what I wrote either in e-mail or in the
start of the comments. I guess I've done two steps in one here: the
factoring isn't just renaming methods and classes. I did create an
MergePolicy interface that is has a slight simplificatin on how the
merge policy is currently implemented.

In particular, the current merge policy as implemented depends on more
than just a sequence of segments: it knows (or assumes) to some extent
how that sequence is created. In particular, in
TestAddIndexesNoOptimize, it knows that a portion of the sequence comes
from the current index and the remainder comes from other indexes.

The MergePolicy as I drafted it has just sequences of segments, without
this extra level of detail.

When I started, I didn't know how big a change this would be so it was
kind of an experiment.

At this point, the straw man version is pretty similar to the
non-factored version: this test is the only one where there's a
difference and it's pretty inconsequential. But the two version do
generate (slightly?) different merge operations and I still want to test
them on bigger corpora to see if there is any significant difference.

If there weren't, my vote would be to keep the simpler version. Does
anyone have any compelling argument for the more complicated case? In
addition to
merge( SegmentInfos )
in MergePolicy, it would have 
merge( SegmentInfos, SegmentInfos )
or maybe even
merge( SegmentInfos[] )
I just don't have got any compelling examples for the extra complexity.

  * It's interesting that you've pulled "useCompoundFile" into the
LegacyMergePolicy.  I'm torn on whether it belongs in MergePolicy
at all, since this is really a file format issue?

Well, the idea was here that you might want to use non-compound files
for big segments (since you have few of them) and compound for smaller
segments. It basically reflects the idea that to some extent, the merge
policy is factoring the number of file descriptors required into its
decision.

-Original Message-
From: Michael McCandless (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Sunday, March 25, 2007 5:49 AM
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-847) Factor merge policy out of
IndexWriter


[
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira
.plugin.system.issuetabpanels:comment-tabpanel#action_12483931 ] 

Michael McCandless commented on LUCENE-847:
---

OK some specific comments, only on the refactoring (ie I haven't
really looked at the new merge policies yet):

  * I also think "minMergeDocs" (which today is one and the same as
maxBufferedDocs in IndexWriter but conceptually could be a
different configuration) also should not appear in the MergePolicy
interface.  I think it should only appear in
LogarithmicMergePolicy?


  * I think we should not create a LegacyMergePolicy interface.  But I
realize you need t

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483931
 ] 

Michael McCandless commented on LUCENE-847:
---

OK some specific comments, only on the refactoring (ie I haven't
really looked at the new merge policies yet):

  * I think maxBufferedDocs should not be exposed in any *MergePolicy
classes or interfaces?  I'm planning on deprecating this param
with LUCENE-843 when we switch by default to "buffering by RAM
usage" and it really relates to "how/when should writer flush its
RAM buffer".

  * I also think "minMergeDocs" (which today is one and the same as
maxBufferedDocs in IndexWriter but conceptually could be a
different configuration) also should not appear in the MergePolicy
interface.  I think it should only appear in
LogarithmicMergePolicy?

If we remove these from the MergePolicy interface then maybe we
don't need MergePolicyBase?  (Just to makes things simpler).

  * I think we should not create a LegacyMergePolicy interface.  But I
realize you need this so the deprecated methods in IndexWriter
(setMergeFactor, setMaxBufferedDocs, setMaxMergeDocs, etc.) can be
implemented.  How about instead these methods will only work if
the current merge policy is the LogarithmicMergePolicy?  You can
check if the current mergePolicy is an instanceof
LogarithmicMergePolicy and then throw eg an IllegalStateException
if it's not?

Ie, going forward, with new and interesting merge policies,
developers should interact with their merge policy to configure
it.

  * I was a little spooked by this change to TestAddIndexesNoOptimize:

  -assertEquals(2, writer.getSegmentCount());
  +assertEquals(3, writer.getSegmentCount());

I think with just the refactoring, there should not need to be any
changes to unit tests right?

  * It's interesting that you've pulled "useCompoundFile" into the
LegacyMergePolicy.  I'm torn on whether it belongs in MergePolicy
at all, since this is really a file format issue?

For example, newly written segments (no longer a "merge" with
LUCENE-843) must also know whether to write in compound file
format.  If we make interesting file format changes in the future
that are configurable by the developer we wouldn't want to change
all MergePolicy classes to propogate that.  It feels like using
compound file or not should remain only in IndexWriter?


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483930
 ] 

Michael McCandless commented on LUCENE-847:
---

My first comment, which I fear will be the most controversial feedback
here :), is a whitespace style question: I'm not really a fan of
"cancerous whitespace" where every ( [ etc has its own whitespace
around it.

I generally prefer minimal whitespace within reason (ie as long as it
does not heavily hurt readability).  The thing is I don't know that
Lucene has settled on this / if anyone else shares my opinion?  It
does seem that "two space indentation" is the standard, which I agree
with, but I don't know if whitespace style has otherwise been agreed
on?  Many people will say it's unimportant to agree on whitespace but
I feel it's actually quite important.


> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483929
 ] 

Michael McCandless commented on LUCENE-847:
---

Steven, I looked through the patch quickly.  It looks great!  First
some general comments and then I'll add more specifics as
separate comments.

Can you open separate issues for the other new and interesting merge
policies here?  I think the refactoring of merge policy plus creation
of the default policy that is identical to today's merge policy, which
should be a fairly quick and low-risk operation, would then remain
under this issue?

Then, iterating / vetting / debugging the new interesting merge
policies can take longer under their own separate issues and time
frame.

On staging I think we could first do this issue (decouple MergePolicy
from writer), then LUCENE-845 because it blocks LUCENE-843 (which
would then be fixing LogarithmicMergePolicy to use segment sizes
instead of docs counts as basis for determing levels) then LUCENE-843
(performance improvements for how writer uses RAM)?




> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Steven Parkes
> I haven't read the details, but should maxBufferedDocs be exposed in
> some subinterfaces instead of the MergePolicy interface?

I've been wondering about this, too, but haven't come to any strong
opinions (yet). I figured maybe playing with a few merge policies might
make things clearer.

maxBufferedDocs: is this truly an invariant of all merge policies? I
don't know. But actually, I think a possible question is whether merge
policies should have any role in this at all, or if IndexWriter should
just do it itself. If we go forward with Mike's stuff about writing a
segment w/multiple docs w/o a merge, it's sounding more like the
buffering of docs is not actually a merge policy a question.

maxMergeDocs: should all merge policies accept this?

> 1) A merge thread is started when an IndexWriter is created and
> stopped when the IndexWriter is closed. (A single merge thread is used
> for simplicity. Multiple merge threads could be used.)

I haven't looked at pooling of threads, whether it be one or more than
one, but I agree it needs to be looked at. I've heard that threads can't
be created willy-nilly in J2EE apps but instead have to be drawn from
the J2EE pool, so I figured when we look at pooling, we might need to
accommodate that kind of constrained environment.

> 2) The merge thread periodically checks if there is merge work to do.
> This is done by synchronously checking segmentInfos and producing a
> MergeSpecification if there is merge work to do.

It does this check via a synchronized call on IndexWriter, right?

> 3) If a MergeSpecification is produced, the merge thread goes ahead
> and does the merge. Importantly, documents in the segments being
> merged may be deleted while the concurrent merge is happening. These
> deletes have to be remembered.

Yup, and I haven't looked at that yet.

> I see you start a thread whenever there is merge work. Would it be
> hard to control system load?

I think it needs to be looked at. Since concurrent conflicting merges
aren't allowed, there is a bound on concurrency, but it might be too
loose a bound. I'm setting up tests to start getting a feel for the
dynamics.

My strawman model was to start with as much concurrency as the data
allowed, then scale it back as necessary.

My main interest is in reducing the latency of add docs. In the example
in my head, I have segments on a number of levels. Lets say merges at
the higher end are going to take 3 seconds, 3 hours, and 3 days. I'd
like to launch the 3 day merge and let it run in the background. It
should be a while before a 3 hour merge is required, but if one is
required before the 3 day merge is complete, I'd like not to block in
that case, too. If load is an issue, the idea would be to lower the
priority or suspend the 3 day merge while the 3 hour merge is going.

My focus isn't on slowing things down, i.e., handling a system where you
truly can't keep up, but in spreading out the big lumps of work, rather
than putting them in the add doc control path.

It's possible that at some point you'll want to do a merge that includes
segments that are being merged concurrently. In that case, the code
currently blocks. There are alternatives, like allowing more than
mergeFactor segments on a level, at least temporarily, but I haven't
gone that way yet. So my way of keeping things simple (if any version of
concurrent can be called simple) is not to make blocking impossible, but
to make it less likely. In the serial case, it's a certainty.

The main thing I've been trying to understand up until now was the
concurrency of IndexWriter#segmentInfos, given that multiple merges
could be running. If you allow that merges could be running AND a merge
might be blocked, you can't make a synchronized call on IndexWriter,
because the blocked merge request holds that.

But my most recent thinking has been that I've been going down the wrong
path trying to separately synchronize segmentInfos. I think instead the
merge threads can make a separate queue of merge results that
IndexWriter can look at when it wants to. I'm gonna look at that soon.
Currently my concurrent stuff won't work because of this part is
incomplete.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Ning Li

Hi Steven,

I haven't read the details, but should maxBufferedDocs be exposed in
some subinterfaces instead of the MergePolicy interface? Since some
policies may use it and others may use byte size or something else.

It's great that you've started on concurrent merge as well! I haven't
got a chance to port my code to the trunk. But I want to share the
design.

1) A merge thread is started when an IndexWriter is created and
stopped when the IndexWriter is closed. (A single merge thread is used
for simplicity. Multiple merge threads could be used.)

2) The merge thread periodically checks if there is merge work to do.
This is done by synchronously checking segmentInfos and producing a
MergeSpecification if there is merge work to do.

3) If a MergeSpecification is produced, the merge thread goes ahead
and does the merge. Importantly, documents in the segments being
merged may be deleted while the concurrent merge is happening. These
deletes have to be remembered.

4) The merge is finished. The new segment should replace the segments
it merged in segmentInfos. But before that, the appropriate deletes
accumulated during the merge should be applied. Same as step 2), this
step is also synchronous. When all done, go to step 2).

I see you start a thread whenever there is merge work. Would it be
hard to control system load?

Comments?

Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483742
 ] 

Steven Parkes commented on LUCENE-847:
--

Visibility is one of those things I haven't cleaned up yet.

Client code is gonna want to create and set merge policies. And it'll want to 
set "external" merge policy parameters. That's all probably not controversial.

As for other stuff, I tend to leave things open, but I know that's debatable 
and don't have a strong opinion in this case.

In fact, there a few things here that are fairly subtle/important. The 
relationship/protocol between the writer and policy is pretty strong. This can 
be seen in the strawman concurrent merge code where the merge policy holds 
state and relies on being called from a synchronized writer method.   If that 
goes forward anything like it is, it would argue for tightening that api up 
some. Chris suggested a way to make the writer<>polcy relationship "atomic". I 
didn't add the code (yet) but I'm not against it.







> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483737
 ] 

Doug Cutting commented on LUCENE-847:
-

How public should such an API be?  Should the interface be public?  Should the 
implementations?  The most conservative approach would be to make it all 
package private, to give more leeway for evolving the update API.  But that 
also decreases the utility.  Thoughts?

> Factor merge policy out of IndexWriter
> --
>
> Key: LUCENE-847
> URL: https://issues.apache.org/jira/browse/LUCENE-847
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Assigned To: Steven Parkes
> Attachments: LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]