iprithv opened a new pull request, #15971:
URL: https://github.com/apache/lucene/pull/15971

   ## Description
   
   In `MaxScoreBulkScorer.scoreInnerWindowMultipleEssentialClauses()`, the 
`cardinality()` call was used solely to pre-size the `docAndScoreAccBuffer` 
before extracting matches from the bitset via `forEach()`. This resulted in two 
full passes over the bitset's 64 longs (for `INNER_WINDOW_SIZE=4096`): one for 
counting, one for extraction.
   
   This change replaces `growNoCopy(windowMatches.cardinality(0, 
innerWindowSize))` with `growNoCopy(INNER_WINDOW_SIZE)`, eliminating the 
counting pass entirely. The buffer is reused across inner windows, so the 
one-time over-allocation (~48KB for `int[] + double[]`) is negligible.
   
   ## Benchmark Results
   
   JMH benchmark on JDK 25, Apple M-series (higher is better):
   
   ```
   Benchmark                              (matchCount)   Mode  Cnt  Score   
Units
   oldCardinalityForEach (before)               50      thrpt    3  6.809  
ops/us
   newForEachNoCardinality (after)              50      thrpt    3  7.686  
ops/us  → +12.9% faster
   
   oldCardinalityForEach (before)              128      thrpt    3  3.044  
ops/us
   newForEachNoCardinality (after)             128      thrpt    3  3.170  
ops/us  → +4.1% faster
   
   oldCardinalityForEach (before)              500      thrpt    3  0.466  
ops/us
   newForEachNoCardinality (after)             500      thrpt    3  0.502  
ops/us  → +7.7% faster
   
   oldCardinalityForEach (before)             1000      thrpt    3  0.242  
ops/us
   newForEachNoCardinality (after)            1000      thrpt    3  0.234  
ops/us  → ~same
   ```
   
   **5-13% improvement** across typical match densities (50-500 docs per 
window), which is the common range for multi-term BooleanQuery workloads.
   
   ## Context
   
   This method is on the hot path for multi-clause BooleanQuery scoring:
   `IndexSearcher.search()` → `MaxScoreBulkScorer.score()` → 
`scoreInnerWindowMultipleEssentialClauses()`
   
   It is invoked for every 4096-doc inner window when a query has 2+ essential 
clauses. The `cardinality()` call was iterating all 64 longs of the bitset 
purely to determine a buffer size — work that can be avoided by pre-allocating 
to the maximum possible size.
   
   An `intoArray()`-based approach was also evaluated but proved slower for 
sparse windows (10-128 matches) due to scanning empty words. The `forEach()` 
approach with pre-allocation is the best strategy across all densities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to