[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13598295#comment-13598295
 ] 

Shai Erera commented on LUCENE-4752:
------------------------------------

Let's define first what do we want to achieve? SortingAtomicReader lets you 
sort a single segment as-is, or by wrapping a MultiReader, sort a bunch of 
segments while merging them together. However, SortingAR is not usable today in 
an online sorting scenario, as it cannot be used by IndexWriter during merge. 
For that, we need to open up SegmentMerger to provide a SortingSegmentMerger.

That gets you in a state of an index that is sorted per-segment, however not 
globally sorted. So for compression scenarios, that's enough. For early 
termination, that should be close to enough, provided you can "abort collection 
on a per-segment basis" -- so you'll accumulate N first docs from each segment 
and then return top-N docs globally. Kind of like how distributed search works.

Now you raise a different scenario - accessing N docs together, which may be 
located in different segments ... that's a new thing. If we want to achieve 
that, we have to have a relationship between segments such that seg1 > seg2 > 
seg3, and then you potentially visit one or only few segments when accessing 
them. Of course, if we achieve that, then you could also early terminate after 
exactly first N docs, no matter from which segment they came.

But that's not at all easy .. I wonder if there is a good way to achieve that 
without globally sorting the index (i.e. what IndexSorter set out to do in the 
first place). For example, if you sort the docs according to their PageRank 
measure, then doc Y (somewhere near IR.maxDoc()) might be preferred over all 
previously existing documents ... how will you pull it from that segment? 
You'll need to sort the index globally so that it bubbles up to the start of 
the index.

If you sort the documents by date, and assuming docs come at a relatively 
already-sorted order (i.e. today you'll meet newer docs than you met yesterday, 
and older than you'll meet tomorrow), then you could do some tricks to stay in 
a globally-sorted (or close to) state.

I think that in light of that, online global sorting of an index is a 
challenging task, one that will need a SortingSegmentMerger / SortingMP or 
whatever anyway, so this issue is still valid, and perhaps we indeed need a 
separate issue to discuss how to achieve the global sorting (something I'm 
trying to achieve in a side-project today, although I have the easy case - 
sort-by-date). BTW, to keep an index globally sorted you cannot use TieredMP 
(at least, you must merge consecutive segments together) and might also need to 
reverse the order of the segments in the index (as viewed by the reader), 
depending on your sort criteria.
                
> Merge segments to sort them
> ---------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>
> It would be awesome if Lucene could write the documents out in a segment 
> based on a configurable order.  This of course applies to merging segments 
> to. The benefit is increased locality on disk of documents that are likely to 
> be accessed together.  This often applies to documents near each other in 
> time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to