[
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592107#comment-13592107
]
Adrien Grand commented on LUCENE-4752:
--------------------------------------
I think a very simple first step could be have an experimental
IndexWriterConfig option to tell IndexWriter to provide an atomic sorted view
(easy once LUCENE-3918 is committed) of the segments to merge to SegmentMerger
instead of the segments themselves (see calls to
SegmentMerger.add(SegmentReader) in IndexWriter.mergeMiddle). Initial segments
would remain unsorted, but the big ones, those that are interesting for both
index compression and early query termination, would be sorted.
It can seem inefficient to sort segments over and over but I don't think we
should worry too much:
- if we are merging "initial" segments (those created from IndexWriter.flush),
they would be small so sorting/merging them would be fast?
- if we are merging big segments, I think that the following reasons could
make merging slower than a regular merge:
1. computing the new doc ID mapping,
2. random I/O access,
3. not being able to use the specialized codec merging methods.
But if the segments to merge are sorted, computing the new doc ID mapping could
be actually fast (some sorting algorithms such as
[TimSort|http://en.wikipedia.org/wiki/Timsort] perform in O(n) when the input
is a succession of sorted sequences), and the access patterns to the individual
segments would be I/O cache-friendly (because each segment would be read
sequentially). So I think this approach could be fast enough?
> Merge segments to sort them
> ---------------------------
>
> Key: LUCENE-4752
> URL: https://issues.apache.org/jira/browse/LUCENE-4752
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/index
> Reporter: David Smiley
> Assignee: Adrien Grand
>
> It would be awesome if Lucene could write the documents out in a segment
> based on a configurable order. This of course applies to merging segments
> to. The benefit is increased locality on disk of documents that are likely to
> be accessed together. This often applies to documents near each other in
> time, but also spatially.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]