jimczi opened a new pull request, #16177:
URL: https://github.com/apache/lucene/pull/16177

   ### Why
   
   Doc-values queries (numeric ranges, set membership) are naturally 
**two-phase**: a
   `SkipBlockRangeIterator` approximation that rides the skip index, plus a 
`matches()` that
   checks the value. That is the form the rest of Lucene drives, two-phase 
confirmation,
   leap-frog conjunctions, `DenseConjunctionBulkScorer`, disjunction 
approximations.
   
   To evaluate dense ranges in bulk (load matches into a bit set, vectorize the 
value check),
   #16050 added a *second* strategy for the same predicate: 
`BatchDocValuesRangeIterator`, a
   plain `DocIdSetIterator` with its own `intoBitSet`, selected by a fork inside
   `SortedNumericDocValuesRangeQuery`. So today a doc-values range is a 
two-phase iterator in
   some code paths and a plain DISI in others — two implementations of one 
predicate, with
   different contracts, kept in lockstep by hand. An optimization applied to 
one (SIMD,
   multi-valued support, …) carries no signal that it diverges from, or 
regresses, the other.
   
   **The goal of this PR is a single strategy: a doc-values query is always a 
two-phase query.**
   Bulk evaluation should be something the two-phase iterator can do — not a 
reason to leave the
   two-phase form or to depend on a specific bulk scorer.
   
   ### What
   
   Make bulk evaluation a capability of the two-phase iterator itself:
   
   - **`TwoPhaseIterator.intoBitSet(upTo, bitSet, offset)`** — default confirms 
`matches()` per
     doc (behavior-preserving); a subclass may override it to load matches in 
bulk.
     `DocValuesRangeIterator` does the vectorized, `upTo`-bounded block walk, 
and fully-matching
     runs collect as ranges via `docIDRunEnd`.
   - **`DenseConjunctionBulkScorer`** drives every two-phase clause through 
this single bit-set
     path (the parallel leap-frog path is removed); a survivor-aware step 
confirms only surviving
     bits when an intersection turns sparse, so dense conjunctions don't 
regress.
   - `SortedNumericDocValuesRangeQuery` drops its fork and returns the two-phase
     `DocValuesRangeIterator`; `BatchDocValuesRangeIterator` is removed.
   
   One representation, one predicate. Bulk evaluation is an override on the 
natural two-phase
   object, reachable by the standard scorers — not a second iterator type bound 
to a specific
   bulk scorer.
   
   ### Benchmarks
   
   The cleanup also wins where it matters. `MultiFieldDocValuesRangeBenchmark`, 
1M docs,
   conjunction of N range clauses (throughput, ops/s), shown against two 
references: the
   two-phase form *before* #16050 turned single-valued ranges into a plain 
`DocIdSetIterator`,
   and current `main` (with #16050).
   
   | pattern (N=3 / N=5)          | two-phase (pre-#16050) | `main`        | 
this PR       | vs two-phase      | vs `main`   |
   | ---------------------------- | ---------------------- | ------------- | 
------------- | ----------------- | ----------- |
   | random (selective)           | 82 / 60                | 76 / 74       | 
145 / 135     | **+76% / +124%**  | +89% / +83% |
   | dense (~70%)                 | 59 / 43                | 113 / 70      | 
118 / 71      | **+98% / +64%**   | +4% / +2%   |
   | clustered (disjoint, ~empty) | 22.8k / 17.0k          | 90.4k / 85.4k | 
56.1k / 54.2k | **+146% / +220%** | −38% / −37% |
   
   Against the two-phase form — what every other doc-values query uses — this 
is a uniform win
   (+64–220%). The one cell trailing `main` is the synthetic 
disjoint-`clustered` case, whose
   intersection is essentially empty; there #16050's plain DISI is faster, but 
even there this
   PR is +146–220% over the two-phase baseline.
   
   Single range, phrase conjunctions, and sparse non-skip ranges are neutral 
(±4%); a sparse
   single skip-indexed range is +41%.
   
   ### Notes
   
   - Single-clause and phrase paths are covered by tests and benchmarks; 
unaffected or improved.
   - A guard rejects a wrapped two-phase passed as a plain iterator (so it can 
never be silently
     consumed as a plain DISI); a test asserts two-phase runs reach 
`collectRange`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to