Jim Ferenczi created LUCENE-9507:
------------------------------------

             Summary: Custom order for leaves in DirectoryReader, IndexWriter 
and searcher
                 Key: LUCENE-9507
                 URL: https://issues.apache.org/jira/browse/LUCENE-9507
             Project: Lucene - Core
          Issue Type: New Feature
            Reporter: Jim Ferenczi


Now that we're able [to skip documents efficiently when sorting by a numeric 
field|https://issues.apache.org/jira/browse/LUCENE-9280], I was wondering if we 
could optimize sorted queries further by also sorting the leaf readers based on 
the primary sort.

For time-based indices in Elasticsearch, we've implemented an optimization that 
does that at query time. If the query is sorted by a numeric docvalue field, 
prior to search, we sort the leaves according to the query sort. When sorting 
by timestamp this small optimization can have a big impact since early 
termination can be reached much faster if the sort values in the segments don't 
overlap too much. Applying this optimization at query time is challenging , it 
has the benefit to work on any numeric field sort and order but it requires to 
use a multi-reader that will reorganize the segments. It can also be deceptive 
that after a force merge to 1 segment sorted queries may be slower since there 
is nothing to sort anymore.

So, another option that I look at is to add the ability to provide a leaf order 
directly in the IndexWriter and DirectoryReader. That could be similar to an 
index sort or even complementary to it since sorting segments based on the 
index sort could also help at query time. For time-based indices that cannot 
afford index sorting but have lots of sorted queries on timestamp, forcing the 
order of segments could speed up sorted queries significantly. 

The advantage of forcing a single leaf sort in the writer/reader is that we can 
also use it to influence the merges by putting the segments with the highest 
value first. That would help with the case of indices that are merged to a 
single segment but would like to keep the sorted queries fast but also for the 
multi-segments case since big segments would have more chance to have highest 
values first too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to