[jira] [Commented] (LUCENE-4752) Merge segments to sort them

Adrien Grand (JIRA) Mon, 04 Mar 2013 02:45:46 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592107#comment-13592107
 ]


Adrien Grand commented on LUCENE-4752:
--------------------------------------

I think a very simple first step could be have an experimental 
IndexWriterConfig option to tell IndexWriter to provide an atomic sorted view 
(easy once LUCENE-3918 is committed) of the segments to merge to SegmentMerger 
instead of the segments themselves (see calls to 
SegmentMerger.add(SegmentReader) in IndexWriter.mergeMiddle). Initial segments 
would remain unsorted, but the big ones, those that are interesting for both 
index compression and early query termination, would be sorted.

It can seem inefficient to sort segments over and over but I don't think we 
should worry too much:
 - if we are merging "initial" segments (those created from IndexWriter.flush), 
they would be small so sorting/merging them would be fast?
 - if we are merging big segments, I think that the following reasons could 
make merging slower than a regular merge:
   1. computing the new doc ID mapping,
   2. random I/O access,
   3. not being able to use the specialized codec merging methods.

But if the segments to merge are sorted, computing the new doc ID mapping could 
be actually fast (some sorting algorithms such as 
[TimSort|http://en.wikipedia.org/wiki/Timsort] perform in O(n) when the input 
is a succession of sorted sequences), and the access patterns to the individual 
segments would be I/O cache-friendly (because each segment would be read 
sequentially). So I think this approach could be fast enough?
                
> Merge segments to sort them
> ---------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>
> It would be awesome if Lucene could write the documents out in a segment 
> based on a configurable order.  This of course applies to merging segments 
> to. The benefit is increased locality on disk of documents that are likely to 
> be accessed together.  This often applies to documents near each other in 
> time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4752) Merge segments to sort them

Reply via email to