ercsonusharma commented on code in PR #4277:
URL: https://github.com/apache/solr/pull/4277#discussion_r3091417611


##########
solr/core/src/java/org/apache/solr/handler/component/CombinedQueryComponent.java:
##########
@@ -296,6 +317,10 @@ protected void mergeIds(ResponseBuilder rb, ShardRequest 
sreq) {
     long approximateTotalHits = 0;
     Map<String, List<ShardDoc>> shardDocMap = new HashMap<>();
     String[] queriesToCombineKeys = 
rb.req.getParams().getParams(CombinerParams.COMBINER_QUERY);
+    // Build per-shard set of doc IDs retained after collapse in simpleCombine.

Review Comment:
   Yes, Sure. Here is the overall flow:
                                                                                
                                                                                
                      
   Phase 1: Shard-level : Each shard runs the combined query request. In 
process():                                                                      
                                                                          
                                                                                
                                                                                
                                              
     1. Per-query execution: Each sub-query executes and applies the 
fq={!collapse} on its own, so each produces its own set of group heads. The 
individual results are added to crb.rsp under "response_per_query" — this is 
what the coordinator reads per-query later.
     2. [Cross-query dedup via 
simpleCombine](https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/component/combine/QueryAndResponseCombiner.java#L57):
 All per-query results are merged and collapse is re-applied across the union.
   
   The result is set as crb.rsp's main "response". This is the shard's 
canonical post-collapse result — it contains only the docs that survived 
collapse across all sub-queries.                                                
              
      
     So the shard response sent to the coordinator contains both:               
                                                                                
                                              
     - "response_per_query" — per-query doc lists (each collapsed individually, 
may have cross-query duplicates)
     - "response" — the simpleCombine output (deduplicated across queries)      
                                                                                
                                              
                                                                          
   Phase 2: [Coordinator-level 
(mergeIds())](https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/component/CombinedQueryComponent.java#L300)
   
   The coordinator iterates per-query responses ("response_per_query") from 
each shard to build shardDocMap (query -> list of ShardDocs), which is then 
passed to the RRF combiner.
   
   The problem is: per-query responses can contain docs that were eliminated by 
simpleCombine on the shard. If we feed them into RRF, those eliminated docs get 
reintroduced in the final result.
   
   Example: On shard-1, `q1` picked `id=1` as group head for `mod3_idv=1`, and 
`q2` picked `id=4`. simpleCombine re-collapsed and kept only one (say id=4). 
But both `id=1` (from q1's per-query response) and `id=4` (from q2's) are still 
in "response_per_query". Without filtering, RRF would see both.
   
   [The fix 
](https://github.com/apache/solr/pull/4277/changes#diff-1ca21c13d967412ea6227fe257616a9cb2ac45012b207373a526d3767a46b588R401-R654):
 combinedDocIdsPerShard extracts the doc IDs from each shard's "response" (the 
simpleCombine output). Then, while iterating per-query docs, any doc not in 
that set is skipped. This ensures RRF/Combiner only operates on docs that 
survived the shard-level cross-query collapse.
   
   Tested in 
[testCollapseWithCombinedQueryProducesDuplicates](https://github.com/apache/solr/pull/4277/changes#diff-d18c0ebaeee21519f08c727c3202b4bc8d47a592e886102b07dc99af0a0b48b4R321)
 - sets up exactly this scenario with two queries scoring the same collapse 
group differently, and asserts no duplicates in the final result.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to