Atri Sharma created LUCENE-8829:
-----------------------------------

             Summary: TopDocs#Merge is Tightly Coupled To Number Of Collectors 
Involved
                 Key: LUCENE-8829
                 URL: https://issues.apache.org/jira/browse/LUCENE-8829
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Atri Sharma


While investigating LUCENE-8819, I understood that TopDocs#merge's order of 
results are indirectly dependent on the number of collectors involved in the 
merge. This is troubling because 1) The number of collectors involved in a 
merge are cost based and directly dependent on the number of slices created for 
the parallel searcher case. 2) TopN hits code path will invoke merge with a 
single Collector, so essentially, doing the same TopN query with single 
threaded and parallel threaded searcher will invoke different order of results, 
which is a bad invariant that breaks.

 

The reason why this happens is because of the subtle way TopDocs#merge sets 
shardIndex in the ScoreDoc population during populating the priority queue used 
for merging. ShardIndex is essentially set to the ordinal of the collector 
which generates the hit. This means that the shardIndex is dependent on the 
number of collectors, even for the same set of hits.

 

In case of no sort order specified, shardIndex is used for tie breaking when 
scores are equal. This translates to different orders for same hits with 
different shardIndices.

 

I propose that we remove shardIndex from the default tie breaking mechanism and 
replace it with docID. DocID order is the de facto that is expected during 
collection, so it might make sense to use the same factor during tie breaking 
when scores are the same.

 

CC: [~ivera]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to