acvictor opened a new pull request, #11711:
URL: https://github.com/apache/incubator-gluten/pull/11711

   ## What changes are proposed in this pull request?
   This PR adds support for pushing `might_contain(bloomFilter, value`) down 
into Velox's subfield filter system via` SparkExprToSubfieldFilterParser`. 
Previously, `might_contain` was evaluated as a post-scan expression. With this 
change, the bloom filter check can be applied at the storage scan level 
allowing entire row groups to be skipped before data is fully decoded.
   
   Velox has two incompatible bloom filter implementations:
      - BloomFilter: used by `bloom_filter_agg` / `might_contain` 
(groups-of-64-bits, 4 hash functions)
      - SplitBlockBloomFilter: used by the existing 
`BigintValuesUsingBloomFilter` filter class (SIMD split-block)
   
   Since these are not interchangeable, a new `SparkBloomFilter` filter class 
is introduced that wraps the serialized `BloomFilter<>` data and implements 
`testInt64()` using `BloomFilter<>::mayContain()` with 
`folly::hasher<int64_t>(`) which is the same code path used by the JNI 
`mightContainLongOnSerializedBloom`.
   
   ## How was this patch tested?
   Added new test suite covering basic filtering, null bloom filter, negation, 
non-column value, range test, and clone behavior.
   
   ## Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to