[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606054#comment-13606054
 ] 

Shai Erera commented on LUCENE-4752:
------------------------------------

I think these are not bad numbers. Sorting is costly, no one argues about that. 
We've discussed in this issue few possible usage for ending up with sorted 
segments: compression, accessing "close" documents together and early 
termination.

What are your observations regarding compression? I am not sure the way you 
tested it can make interesting observations since you sort by random field 
(which means the index may be randomly sorted), but perhaps if you sorted by 
Wikipedia's date value or something else? Just curious.

As for search, perhaps we can quickly hack up IndexSearcher to allow 
terminating per-segment and then compare two Collectors TopFields and 
TopSortedFields (new), the latter terminates after vising first N docs in each 
segment? Hmm, but in order to do that, we must make sure that each segment is 
sorted (i.e. those that are not hit by MP are still in random order), or we 
somehow mark on each segment whether it's sorted or not. If we can hack this 
comparison, I think it's worth to note here the differences. Actually enabling 
per-segment termination should happen on a separate issue.

Accessing "close" documents together ... we can make an artificial test which 
accesses documents with sort-by-value in a specific range. But that's a too 
artificial test, not sure what it will tell us.
                
> Merge segments to sort them
> ---------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>         Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
> LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, 
> sorting_10M_ingestion.log
>
>
> It would be awesome if Lucene could write the documents out in a segment 
> based on a configurable order.  This of course applies to merging segments 
> to. The benefit is increased locality on disk of documents that are likely to 
> be accessed together.  This often applies to documents near each other in 
> time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to