heng-kuang-777 opened a new issue, #17809:
URL: https://github.com/apache/pinot/issues/17809
### Problem
On consuming segments, Lucene text indexes are near-realtime — recently
ingested documents may not yet be visible to the `IndexSearcher` until the next
`SearcherManager` refresh. When NOT(TEXT_MATCH(...)) is evaluated, these
unindexed documents incorrectly appear in the result as false positives.
Example: Segment has 1000 docs, Lucene refreshed up to doc 950, matches for
'error' = {10, 42, 300}.
Expected: [0, 950) - matches = 947 docs (only docs Lucene has evaluated)
Actual: [0, 1000) - matches = 997 docs (docs 950–999 are false positives)
This affects both execution paths:
* Row materialization (SELECT ... WHERE NOT TEXT_MATCH(...)) —
`NotFilterOperator.getTrues()` calls `TextMatchFilterOperator.getFalses()`,
which inverts over [0, numDocs).
* Optimized count (SELECT COUNT(*) WHERE NOT TEXT_MATCH(...)) —
`NotFilterOperator.getBitmaps()` uses numDocs as the universe for
BitmapCollection inversion.
### Root Cause
`TextMatchFilterOperator` reports numDocs (the full segment doc count) as
its inversion universe in both `getFalses()` and `getBitmaps()`. But Lucene can
only search a subset of those docs on consuming segments. The inversion assumes
all docs in [0, numDocs) have been evaluated, which is not true before a Lucene
refresh.
### Proposed Fix
* Introduce a searchable doc fence: add a `getSearchableDocCount()` default
method to `TextIndexReader` (returns -1 = all docs searchable).
* `RealtimeLuceneTextIndex` overrides it using the existing
`RealtimeLuceneRefreshListener._lastRefreshNumDocs`, which already tracks the
doc count at each Lucene refresh (currently only used for delay metrics).
* In `TextMatchFilterOperator`:
1. Store `_searchableDocCount` (resolved from the reader, falling back to
numDocs when -1)
2. Override `getFalses()` to invert only over [0, searchableDocCount)
instead of [0, numDocs)
3. Update `getBitmaps()` to use searchableDocCount as the universe for
BitmapCollection
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]