heng-kuang-777 opened a new issue, #17809:
URL: https://github.com/apache/pinot/issues/17809

   ### Problem
   On consuming segments, Lucene text indexes are near-realtime — recently 
ingested documents may not yet be visible to the `IndexSearcher` until the next 
`SearcherManager` refresh. When NOT(TEXT_MATCH(...)) is evaluated, these 
unindexed documents incorrectly appear in the result as false positives.
   Example: Segment has 1000 docs, Lucene refreshed up to doc 950, matches for 
'error' = {10, 42, 300}.
   Expected: [0, 950) - matches = 947 docs (only docs Lucene has evaluated)
   Actual: [0, 1000) - matches = 997 docs (docs 950–999 are false positives)
   This affects both execution paths:
   * Row materialization (SELECT ... WHERE NOT TEXT_MATCH(...)) — 
`NotFilterOperator.getTrues()` calls `TextMatchFilterOperator.getFalses()`, 
which inverts over [0, numDocs).
   * Optimized count (SELECT COUNT(*) WHERE NOT TEXT_MATCH(...)) — 
`NotFilterOperator.getBitmaps()` uses numDocs as the universe for 
BitmapCollection inversion.
   
   ### Root Cause
   `TextMatchFilterOperator` reports numDocs (the full segment doc count) as 
its inversion universe in both `getFalses()` and `getBitmaps()`. But Lucene can 
only search a subset of those docs on consuming segments. The inversion assumes 
all docs in [0, numDocs) have been evaluated, which is not true before a Lucene 
refresh.
   
   ### Proposed Fix
   * Introduce a searchable doc fence: add a `getSearchableDocCount()` default 
method to `TextIndexReader` (returns -1 = all docs searchable). 
   * `RealtimeLuceneTextIndex` overrides it using the existing 
`RealtimeLuceneRefreshListener._lastRefreshNumDocs`, which already tracks the 
doc count at each Lucene refresh (currently only used for delay metrics).
   * In `TextMatchFilterOperator`:
     1. Store `_searchableDocCount` (resolved from the reader, falling back to 
numDocs when -1)
     2. Override `getFalses()` to invert only over [0, searchableDocCount) 
instead of [0, numDocs)
     3. Update `getBitmaps()` to use searchableDocCount as the universe for 
BitmapCollection


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to