Hi, I noticed that it finally landed. We actually began tracking that JIRA while initially writing Hudi at Uber.. Parquet + Bloom Filters has taken just a few years :) I think we could switch out to reading the built-in bloom filters as well. it could make the footer reading lighter potentially.
Few things that Hudi has built on top would be missing - Dynamic bloom filter support, where we auto size current bloom filters based on number of records, given a fpp target - Our current DAG that optimizes for checking records against bloom filters is still needed on writer side. Checking bloom filters for a given predicate e.g id=19, is much simpler compared to matching say a 100k ids against 1000 files. We need to be able to amortize the cost of these 100M comparisons. On the future direction, with 0.11, we are enabling storing of bloom filters and column ranges inside the Hudi metadata table (MDT). *(what we call multi modal indexes). This helps us make the access more resilient towards cloud storage throttling and also more performant (we need to read much fewer files) Over time, when this mechanism is stable, we plan to stop writing out bloom filters in parquet and also integrate the Hudi MDT with different query engines for point-ish lookups. Hope that helps Thanks Vinoth On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris <[email protected]> wrote: > Hi, > > spark 3.2 ships parquet 1.12 which provides built-in bloom filters on > arbirtrary columns. I wonder if: > > - hudi can benefit from them ? (likely in 0.11, but not with MOR tables) > - would make sense to replace the hudi blooms with them ? > - what would be the advantage of storing our blooms in hfiles (AFAIK > this is the future expected implementation) over the parquet built-in. > > > here is the syntax: > > .option("parquet.bloom.filter.enabled#favorite_color", "true") > .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000") > > > and here some code to illustrate : > > > https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654 > > > > thx >
