Re: spark 3.2.1 built-in bloom filters

Vinoth Chandar Wed, 30 Mar 2022 14:37:12 -0700

Hi,

I noticed that it finally landed. We actually began tracking that JIRA
while initially writing Hudi at Uber.. Parquet + Bloom Filters has taken
just a few years :)
I think we could switch out to reading the built-in bloom filters as well.
it could make the footer reading lighter potentially.


Few things that Hudi has built on top would be missing

- Dynamic bloom filter support, where we auto size current bloom filters
based on number of records, given a fpp target
- Our current DAG that optimizes for checking records against bloom filters
is still needed on writer side. Checking bloom filters for a given
predicate e.g id=19, is much simpler compared to matching say a 100k ids
against 1000 files. We need to be able to amortize the cost of these 100M
comparisons.

On the future direction, with 0.11, we are enabling storing of bloom
filters and column ranges inside the Hudi metadata table (MDT). *(what we
call multi modal indexes).
This helps us make the access more resilient towards cloud storage
throttling and also more performant (we need to read much fewer files)

Over time, when this mechanism is stable, we plan to stop writing out bloom
filters in parquet and also integrate the Hudi MDT with different query
engines for point-ish lookups.

Hope that helps

Thanks
Vinoth




On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris <[email protected]>
wrote:

> Hi,
>
> spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
> arbirtrary columns. I wonder if:
>
> - hudi can benefit from them ? (likely in 0.11, but not with MOR tables)
> - would make sense to replace the hudi blooms with them ?
> - what would be the advantage of storing our blooms in hfiles (AFAIK
>   this is the future expected implementation) over the parquet built-in.
>
>
> here is the syntax:
>
>     .option("parquet.bloom.filter.enabled#favorite_color", "true")
>     .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000")
>
>
> and here some code to illustrate :
>
>
> https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654
>
>
>
> thx
>

Re: spark 3.2.1 built-in bloom filters

Reply via email to