iprithv commented on PR #15991:
URL: https://github.com/apache/lucene/pull/15991#issuecomment-4441077434

   @romseygeek docIDRunEnd() is consumed today by DenseConjunctionBulkScorer, 
ReqExclBulkScorer, and SkipBlockRangeIterator. MaxScoreBulkScorer, 
ConjunctionBulkScorer, BlockMaxConjunctionBulkScorer and DefaultBulkScorer all 
use leap-frog advance() and ignore run-end. (grep docIDRunEnd under 
lucene/core/src/java/org/apache/lucene/search/ to verify)
   
   DenseConjunctionBulkScorer is selected from two places, 
ConstantScoreScorerSupplier.bulkScorer() and 
BooleanScorerSupplier.requiredBulkScorer(). And the boolean path is gated on 
requiredScoring.isEmpty(). The shape this PR targets (scoring MUST + 
primary-sort-aligned FILTER, e.g. BM25 text query with a date/category filter) 
puts the MUST scorer in requiredScoring, so that branch is bypassed. We land in 
ConjunctionBulkScorer (for COMPLETE / COMPLETE_NO_SCORES) or MaxScoreBulkScorer 
(for TOP_SCORES), neither of which currently reads docIDRunEnd(). 
BooleanQuery.rewriteNoScoring() would flip MUST→FILTER, but it's only invoked 
from ConstantScoreQuery#rewrite, not from searcher.search(..., 
TopFieldCollectorManager(...)), so even the no scores case still keeps the MUST 
in requiredScoring.
   
   For the term-filter case you mentioned specifically: PostingsEnum inherits 
the default docIDRunEnd() = docID()+1, so even if all docs in a primary-sort 
segment match the term contiguously, today's postings iterator doesn't expose 
that run to DenseConjunctionBulkScorer.
   
   Covering this shape in the approach you mentioned would need these: 
   1. Postings (or a wrapper) exposing real docIDRunEnd() on 
primary-sort-aligned terms, 
   2. MaxScoreBulkScorer and ConjunctionBulkScorer learning to consume 
docIDRunEnd() for range-skipping
   3. For COMPLETE_NO_SCORES, a path that converts MUST→FILTER outside of 
ConstantScoreQuery#rewrite. 
    
   Even after these it still leaves TOP_SCORES uncovered, which is where this 
PR shows the biggest delta (+74% in benchmarkSortedTopScores; the rest are +44% 
to +47% across the COMPLETE / COMPLETE_NO_SCORES modes).
   
   I agree that making cost() and docIDRunEnd() the durable primitives and 
letting bulk scorers act on them is the right long term architecture, a more 
comprehensive docIDRunEnd() investment would let this PR's wrapper shrink or 
disappear over time. This PR is meant to be complementary rather than a 
replacement for that work.
   
   Regarding complexity, the scope grew from review feedback to implement the 
interface directly on PointRangeQuery, SortedNumericDocValuesRangeQuery, and 
SortedSetDocValuesRangeQuery in addition to the original numeric range path. 
Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to