[ https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551429#comment-17551429 ]
Adrien Grand commented on LUCENE-10396: --------------------------------------- Another potential use-case we are interested in for Elasticsearch would be to have the ability to visit a single document per field value for a field that is the primary index sort. This has a few applications, one of them is to compute the number of unique values of this primary sort field for documents that match a query. The collector could implement {{LeafCollector#competitiveIterator}} by using the sparse index to skip all documents that have the same value as the last collected hit. > Automatically create sparse indexes for sort fields > --------------------------------------------------- > > Key: LUCENE-10396 > URL: https://issues.apache.org/jira/browse/LUCENE-10396 > Project: Lucene - Core > Issue Type: Task > Reporter: Adrien Grand > Priority: Minor > Attachments: sorted_conjunction.png > > > On Elasticsearch we're more and more leveraging index sorting not as a way to > be able to early terminate sorted queries, but as a way to cluster doc IDs > that share similar properties so that queries can take advantage of it. For > instance imagine you're maintaining a catalog of cars for sale, by sorting by > car type, then fuel type then price. Then all cars with the same type, fuel > type and similar prices will be stored in a contiguous range of doc IDs. > Without index sorting, conjunctions across these 3 fields would be almost a > worst-case scenario as every clause might match lots of documents while their > intersection might not. With index sorting enabled however, there's only a > very small number of calls to advance() that would lead to doc IDs that do > not match, because these advance() calls that do not lead to a match would > always jump over a large number of doc IDs. I had created that example for > ApacheCon last year that demonstrates the benefits of index sorting on > conjunctions. In both cases, the index is storing the same data, it just gets > different doc ID ordering thanks to index sorting: > !sorted_conjunction.png! > While index sorting can help improve query efficiency out-of-the-box, there > is a lot more we can do by taking advantage of the index sort explicitly. For > instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range > queries on fields that are primary sort fields by performing a binary search > to identify the first and last documents that match the range query. I would > like to introduce [sparse > indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields > that are used for index sorting, with the goal of improving the runtime of > {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive > and making it easier and more efficient to leverage index sorting to filter > on subsequent sort fields. A simple form of a sparse index could consist of > storing every N-th values of the fields that are used for index sorting. > In terms of implementation, sparse indexing should be cheap enough that we > wouldn't need to make it configurable and could enable it automatically as > soon as index sorting is enabled. And it would get its own file format > abstraction. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org