jaepil commented on PR #16106:
URL: https://github.com/apache/lucene/pull/16106#issuecomment-4576068603
Confirmed -- the diagnosis in the issue is accurate and this is the right
fix. A short summary for other reviewers, since the effect is subtle.
**What the dead clause does**
`MatchNoDocsQuery` matches no documents (its `ScorerSupplier` is null), so
it never contributes to `logitSum`, yet it was still counted in `totalClauses`.
In the uniform-weight path:
```
score = sigmoid( (logitSum / N) * N^alpha )
= sigmoid( logitSum * N^(alpha - 1) ) // N = totalClauses
```
A dead clause raises `N` without changing `logitSum`. For `alpha < 1`
(default 0.5), `N^(alpha - 1)` shrinks as `N` grows, so every matching doc's
score is pulled toward `sigmoid(0) = 0.5` -- exactly the dilution described in
the issue.
**Ranking vs. calibration (why it is worth fixing)**
This can look like a no-op: within a single query `N` is a constant shared
by every doc, so the ordering is preserved and single-query top-K does not
change. What changes is the absolute score, which matters whenever it is
consumed as a value rather than an order: nested in a larger scoring query
(`BooleanQuery`, another fusion layer, a KNN clause), compared against a
min-score / probability threshold, or compared across queries. Those are the
hybrid-search cases this query exists for, and an inflated `N` silently biases
them away from a score that stays interpretable as a probability.
**Weight re-normalization is correct**
Re-normalizing the survivors to sum to 1.0 is not only to satisfy the
constructor check -- it is semantically right. The dead clause's weight mass
was otherwise lost (active weights summed to < 1), under-counting the live
signals. Re-normalizing restores the intended relative reliabilities (e.g.
`{0.4, 0.4}` -> `{0.5, 0.5}`) and the sum-to-1 invariant the scorer assumes.
Returning the bare clause for a single survivor also matches the existing
`clauses.size() == 1` unwrapping, the filtered query reaches a rewrite
fixpoint, and this mirrors the `MatchNoDocsQuery` filtering already in
`BooleanQuery` / `DisjunctionMaxQuery`.
The new tests pin the fix structurally (rewritten query `equals` the clean
N-1 clause version). I would recommend adding a scoring test as well: fuse a
query with a dead clause inside an outer `BooleanQuery` and assert the score
equals the dead-clause-free version. Since single-query ranking is unaffected,
the calibration behavior is the part most likely to regress silently, so a
direct score assertion is worth having.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]