ercsonusharma commented on code in PR #4277:
URL: https://github.com/apache/solr/pull/4277#discussion_r3091417611
##########
solr/core/src/java/org/apache/solr/handler/component/CombinedQueryComponent.java:
##########
@@ -296,6 +317,10 @@ protected void mergeIds(ResponseBuilder rb, ShardRequest
sreq) {
long approximateTotalHits = 0;
Map<String, List<ShardDoc>> shardDocMap = new HashMap<>();
String[] queriesToCombineKeys =
rb.req.getParams().getParams(CombinerParams.COMBINER_QUERY);
+ // Build per-shard set of doc IDs retained after collapse in simpleCombine.
Review Comment:
Yes, Sure. Here is the overall flow:
Phase 1: Shard-level : Each shard runs the combined query request. In
process():
1. Per-query execution: Each sub-query executes and applies the
fq={!collapse} on its own, so each produces its own set of group heads. The
individual results are added to crb.rsp under "response_per_query" — this is
what the coordinator reads per-query later.
2. [Cross-query dedup via
simpleCombine](https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/component/combine/QueryAndResponseCombiner.java#L57):
All per-query results are merged and collapse is re-applied across the union.
The result is set as crb.rsp's main "response". This is the shard's
canonical post-collapse result — it contains only the docs that survived
collapse across all sub-queries.
So the shard response sent to the coordinator contains both:
- "response_per_query" — per-query doc lists (each collapsed individually,
may have cross-query duplicates)
- "response" — the simpleCombine output (deduplicated across queries)
Phase 2: [Coordinator-level
(mergeIds())](https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/component/CombinedQueryComponent.java#L300)
The coordinator iterates per-query responses ("response_per_query") from
each shard to build shardDocMap (query -> list of ShardDocs), which is then
passed to the RRF combiner.
The problem is: per-query responses can contain docs that were eliminated by
simpleCombine on the shard. If we feed them into RRF, those eliminated docs get
reintroduced in the final result.
Example: On shard-1, `q1` picked `id=1` as group head for `mod3_idv=1`, and
`q2` picked `id=4`. simpleCombine re-collapsed and kept only one (say id=4).
But both `id=1` (from q1's per-query response) and `id=4` (from q2's) are still
in "response_per_query". Without filtering, RRF would see both.
[The fix
](https://github.com/apache/solr/pull/4277/changes#diff-1ca21c13d967412ea6227fe257616a9cb2ac45012b207373a526d3767a46b588R401-R654):
combinedDocIdsPerShard extracts the doc IDs from each shard's "response" (the
simpleCombine output). Then, while iterating per-query docs, any doc not in
that set is skipped. This ensures RRF/Combiner only operates on docs that
survived the shard-level cross-query collapse.
Tested in
[testCollapseWithCombinedQueryProducesDuplicates](https://github.com/apache/solr/pull/4277/changes#diff-d18c0ebaeee21519f08c727c3202b4bc8d47a592e886102b07dc99af0a0b48b4R321)
- sets up exactly this scenario with two queries scoring the same collapse
group differently, and asserts no duplicates in the final result.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]