[
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606054#comment-13606054
]
Shai Erera commented on LUCENE-4752:
------------------------------------
I think these are not bad numbers. Sorting is costly, no one argues about that.
We've discussed in this issue few possible usage for ending up with sorted
segments: compression, accessing "close" documents together and early
termination.
What are your observations regarding compression? I am not sure the way you
tested it can make interesting observations since you sort by random field
(which means the index may be randomly sorted), but perhaps if you sorted by
Wikipedia's date value or something else? Just curious.
As for search, perhaps we can quickly hack up IndexSearcher to allow
terminating per-segment and then compare two Collectors TopFields and
TopSortedFields (new), the latter terminates after vising first N docs in each
segment? Hmm, but in order to do that, we must make sure that each segment is
sorted (i.e. those that are not hit by MP are still in random order), or we
somehow mark on each segment whether it's sorted or not. If we can hack this
comparison, I think it's worth to note here the differences. Actually enabling
per-segment termination should happen on a separate issue.
Accessing "close" documents together ... we can make an artificial test which
accesses documents with sort-by-value in a specific range. But that's a too
artificial test, not sure what it will tell us.
> Merge segments to sort them
> ---------------------------
>
> Key: LUCENE-4752
> URL: https://issues.apache.org/jira/browse/LUCENE-4752
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/index
> Reporter: David Smiley
> Assignee: Adrien Grand
> Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch,
> LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log,
> sorting_10M_ingestion.log
>
>
> It would be awesome if Lucene could write the documents out in a segment
> based on a configurable order. This of course applies to merging segments
> to. The benefit is increased locality on disk of documents that are likely to
> be accessed together. This often applies to documents near each other in
> time, but also spatially.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]