[PR] [feature](inverted index) add token-exists Bloom Filter absent-term fast path [doris]

via GitHub Mon, 08 Jun 2026 03:44:59 -0700


airborne12 opened a new pull request, #64229:
URL: https://github.com/apache/doris/pull/64229


   ### What problem does this PR solve?
   
   Issue Number: N/A
   
   Related PR: N/A
   
   Problem Summary:
   
   In storage-compute separation, querying an exact term that does **not** 
exist in a
   segment still pays the full searcher open -- open the compound reader, 
materialize
   `.tii` into memory, read `null_bitmap`, probe `.tis` -- only to discover the 
term has
   no postings. On S3 that is a wasted remote round-trip per segment.
   
   This PR adds an optional, CLucene-compatible (no storage-format-version 
bump),
   **default-off** token-exists Bloom Filter:
   
   - A self-describing `"tbf"` sub-file inside the compound `.idx` records 
which analyzed
     tokens exist in the segment's term dictionary, fed from the term 
dictionary itself
     (no re-tokenization, zero inconsistency). On query, an ABSENT verdict 
short-circuits
     to an empty bitmap before any searcher-open IO. The BF guarantees no false 
negatives,
     so absent -> empty is always correct; never-drop-results guardrails (A1 
phrase
     position grouping, A2 multi-term-slot OR, A3 analyzer-signature staleness, 
A4 keyword
     path, A5 empty keyword token) fall back to the normal lookup on any 
uncertainty.
   - An LRU cache of the parsed BF per (segment, index), so a warm absent query 
does zero IO.
   - Query-profile observability: headline `InvertedIndexTermBfSkippedLookups` 
(lookups the
     BF short-circuited) + `InvertedIndexTermBfProbe` (denominator for hit 
rate) +
     `InvertedIndexTermBfUnavailable` (no usable tbf), plus level-2 diagnostics 
(cache
     hit/miss, cold load IO, fall-throughs).
   - An env-gated fpp sweep (analysis tool, never runs in CI) used to justify 
keeping the
     default `fpp = 0.01`.
   
   Switches: index property `token_bloom_filter` (write) + BE config
   `enable_inverted_index_term_bf` (read, default `false`); BF cache sized by 
BE config
   `inverted_index_term_bf_cache_limit` (default `1%`).
   
   Measured (instrumented UT, 1M-row segment): absent 
`MATCH_ALL`/`MATCH_PHRASE` read_at
   8 -> 1; present queries unchanged (84 -> 84); warm absent query 0 sub-file 
reads.
   
   ### Release note
   
   Add an optional token-exists Bloom Filter for inverted indexes that 
fast-paths absent
   exact-term queries (skips the searcher open / index IO) under storage-compute
   separation. Opt-in via index property `token_bloom_filter` and BE config
   `enable_inverted_index_term_bf` (default off). No storage format change.
   
   ### Check List (For Author)
   
   - Test
       - [x] Unit Test
   - Behavior changed:
       - [x] No. <!-- opt-in, default-off; existing queries are unaffected when 
the property/config are not set -->
   - Does this need documentation?
       - [x] Yes. <!-- a doc PR will follow once the feature graduates from 
default-off; a design doc is kept internally for now -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [feature](inverted index) add token-exists Bloom Filter absent-term fast path [doris]

Reply via email to