iprithv commented on PR #15991:
URL: https://github.com/apache/lucene/pull/15991#issuecomment-4441077434
@romseygeek docIDRunEnd() is consumed today by DenseConjunctionBulkScorer,
ReqExclBulkScorer, and SkipBlockRangeIterator. MaxScoreBulkScorer,
ConjunctionBulkScorer, BlockMaxConjunctionBulkScorer and DefaultBulkScorer all
use leap-frog advance() and ignore run-end. (grep docIDRunEnd under
lucene/core/src/java/org/apache/lucene/search/ to verify)
DenseConjunctionBulkScorer is selected from two places,
ConstantScoreScorerSupplier.bulkScorer() and
BooleanScorerSupplier.requiredBulkScorer(). And the boolean path is gated on
requiredScoring.isEmpty(). The shape this PR targets (scoring MUST +
primary-sort-aligned FILTER, e.g. BM25 text query with a date/category filter)
puts the MUST scorer in requiredScoring, so that branch is bypassed. We land in
ConjunctionBulkScorer (for COMPLETE / COMPLETE_NO_SCORES) or MaxScoreBulkScorer
(for TOP_SCORES), neither of which currently reads docIDRunEnd().
BooleanQuery.rewriteNoScoring() would flip MUST→FILTER, but it's only invoked
from ConstantScoreQuery#rewrite, not from searcher.search(...,
TopFieldCollectorManager(...)), so even the no scores case still keeps the MUST
in requiredScoring.
For the term-filter case you mentioned specifically: PostingsEnum inherits
the default docIDRunEnd() = docID()+1, so even if all docs in a primary-sort
segment match the term contiguously, today's postings iterator doesn't expose
that run to DenseConjunctionBulkScorer.
Covering this shape in the approach you mentioned would need these:
1. Postings (or a wrapper) exposing real docIDRunEnd() on
primary-sort-aligned terms,
2. MaxScoreBulkScorer and ConjunctionBulkScorer learning to consume
docIDRunEnd() for range-skipping
3. For COMPLETE_NO_SCORES, a path that converts MUST→FILTER outside of
ConstantScoreQuery#rewrite.
Even after these it still leaves TOP_SCORES uncovered, which is where this
PR shows the biggest delta (+74% in benchmarkSortedTopScores; the rest are +44%
to +47% across the COMPLETE / COMPLETE_NO_SCORES modes).
I agree that making cost() and docIDRunEnd() the durable primitives and
letting bulk scorers act on them is the right long term architecture, a more
comprehensive docIDRunEnd() investment would let this PR's wrapper shrink or
disappear over time. This PR is meant to be complementary rather than a
replacement for that work.
Regarding complexity, the scope grew from review feedback to implement the
interface directly on PointRangeQuery, SortedNumericDocValuesRangeQuery, and
SortedSetDocValuesRangeQuery in addition to the original numeric range path.
Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]