[jira] [Commented] (LUCENE-4752) Merge segments to sort them

Shai Erera (JIRA) Tue, 19 Mar 2013 21:49:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607282#comment-13607282
 ]


Shai Erera commented on LUCENE-4752:
------------------------------------

What in the patch guarantees that any segment with more than maxBufferedDocs is 
sorted? Perhaps I've missed it, but I looked for code which ensures every such 
segment gets picked up by SortingMP, however didn't find it.

I don't think that in general we should make assumptions based on a 
maxBufferedDocs setting because the default setting in IWC is per RAM 
consumption and also it seems slightly unrelated. I.e. if a segment is sorted, 
but has deletions such that numDocs < maxBufferedDocs, we do full collection, 
while we can early terminate as usual?

EarlyTerminatingCollector, I think, need not have getFullCollector. Rather it 
should wrap any other Collector (not limited to top doc) and if it detects a 
sorted segment in setNextReader (we still need to figure out how to detect 
that), early terminate after enough docs were seen, otherwise keep on calling 
in.collect()? It's the app's responsibility to wrap its collector (which could 
be ChainingCollector too) with this collector, and make sure that its early 
termination logic fits with its collectors. And so I don't think we need 
EarlyTerminationTopDocsCollector, but rather a concrete 
EarlyTerminatingCollector. BTW, EarlyTerminationTopDocsCollector has an 
uninitialized and unused maxUnsortedSize?

And hopefully we can stuff the early termination logic down to IndexSearcher 
eventually. There are other scenarios for early termination, such as time 
limit, and therefore I think it's ok if we have an EarlyTerminationException 
which IndexSearcher responds to.

Adrien, perhaps in order to keep the patch small, commit separately the changes 
to LTC and TestDuelingCodec (as well as the SortingAtomicReader.wrap change)? 
These are good changes to commit anyway, and they only bloat out the patch and 
mask the actual issue's development? Is it possible?
                
> Merge segments to sort them
> ---------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>         Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
> LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
> natural_10M_ingestion.log, sorting_10M_ingestion.log
>
>
> It would be awesome if Lucene could write the documents out in a segment 
> based on a configurable order.  This of course applies to merging segments 
> to. The benefit is increased locality on disk of documents that are likely to 
> be accessed together.  This often applies to documents near each other in 
> time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4752) Merge segments to sort them

Reply via email to