jimczi opened a new pull request, #16177:
URL: https://github.com/apache/lucene/pull/16177
### Why
Doc-values queries (numeric ranges, set membership) are naturally
**two-phase**: a
`SkipBlockRangeIterator` approximation that rides the skip index, plus a
`matches()` that
checks the value. That is the form the rest of Lucene drives, two-phase
confirmation,
leap-frog conjunctions, `DenseConjunctionBulkScorer`, disjunction
approximations.
To evaluate dense ranges in bulk (load matches into a bit set, vectorize the
value check),
#16050 added a *second* strategy for the same predicate:
`BatchDocValuesRangeIterator`, a
plain `DocIdSetIterator` with its own `intoBitSet`, selected by a fork inside
`SortedNumericDocValuesRangeQuery`. So today a doc-values range is a
two-phase iterator in
some code paths and a plain DISI in others — two implementations of one
predicate, with
different contracts, kept in lockstep by hand. An optimization applied to
one (SIMD,
multi-valued support, …) carries no signal that it diverges from, or
regresses, the other.
**The goal of this PR is a single strategy: a doc-values query is always a
two-phase query.**
Bulk evaluation should be something the two-phase iterator can do — not a
reason to leave the
two-phase form or to depend on a specific bulk scorer.
### What
Make bulk evaluation a capability of the two-phase iterator itself:
- **`TwoPhaseIterator.intoBitSet(upTo, bitSet, offset)`** — default confirms
`matches()` per
doc (behavior-preserving); a subclass may override it to load matches in
bulk.
`DocValuesRangeIterator` does the vectorized, `upTo`-bounded block walk,
and fully-matching
runs collect as ranges via `docIDRunEnd`.
- **`DenseConjunctionBulkScorer`** drives every two-phase clause through
this single bit-set
path (the parallel leap-frog path is removed); a survivor-aware step
confirms only surviving
bits when an intersection turns sparse, so dense conjunctions don't
regress.
- `SortedNumericDocValuesRangeQuery` drops its fork and returns the two-phase
`DocValuesRangeIterator`; `BatchDocValuesRangeIterator` is removed.
One representation, one predicate. Bulk evaluation is an override on the
natural two-phase
object, reachable by the standard scorers — not a second iterator type bound
to a specific
bulk scorer.
### Benchmarks
The cleanup also wins where it matters. `MultiFieldDocValuesRangeBenchmark`,
1M docs,
conjunction of N range clauses (throughput, ops/s), shown against two
references: the
two-phase form *before* #16050 turned single-valued ranges into a plain
`DocIdSetIterator`,
and current `main` (with #16050).
| pattern (N=3 / N=5) | two-phase (pre-#16050) | `main` |
this PR | vs two-phase | vs `main` |
| ---------------------------- | ---------------------- | ------------- |
------------- | ----------------- | ----------- |
| random (selective) | 82 / 60 | 76 / 74 |
145 / 135 | **+76% / +124%** | +89% / +83% |
| dense (~70%) | 59 / 43 | 113 / 70 |
118 / 71 | **+98% / +64%** | +4% / +2% |
| clustered (disjoint, ~empty) | 22.8k / 17.0k | 90.4k / 85.4k |
56.1k / 54.2k | **+146% / +220%** | −38% / −37% |
Against the two-phase form — what every other doc-values query uses — this
is a uniform win
(+64–220%). The one cell trailing `main` is the synthetic
disjoint-`clustered` case, whose
intersection is essentially empty; there #16050's plain DISI is faster, but
even there this
PR is +146–220% over the two-phase baseline.
Single range, phrase conjunctions, and sparse non-skip ranges are neutral
(±4%); a sparse
single skip-indexed range is +41%.
### Notes
- Single-clause and phrase paths are covered by tests and benchmarks;
unaffected or improved.
- A guard rejects a wrapped two-phase passed as a plain iterator (so it can
never be silently
consumed as a plain DISI); a test asserts two-phase runs reach
`collectRange`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]