Re: spark 3.2.1 built-in bloom filters

Nicolas Paris Thu, 19 May 2022 08:50:51 -0700

As we now got hudi 0.11 with multiple columns bloom indexes thougth
`hoodie.metadata.index.bloom.filter.column.list`, the question is wether
those bloom are used by query planner for e.g id=19


The spark built-in blooms are used in this case, maybe that's also the
hudi multi-bloom purpose as well ? (there is no mention about their use)


thanks




On Wed Mar 30, 2022 at 11:36 PM CEST, Vinoth Chandar wrote:
> Hi,
>
> I noticed that it finally landed. We actually began tracking that JIRA
> while initially writing Hudi at Uber.. Parquet + Bloom Filters has taken
> just a few years :)
> I think we could switch out to reading the built-in bloom filters as
> well.
> it could make the footer reading lighter potentially.
>
> Few things that Hudi has built on top would be missing
>
> - Dynamic bloom filter support, where we auto size current bloom filters
> based on number of records, given a fpp target
> - Our current DAG that optimizes for checking records against bloom
> filters
> is still needed on writer side. Checking bloom filters for a given
> predicate e.g id=19, is much simpler compared to matching say a 100k ids
> against 1000 files. We need to be able to amortize the cost of these
> 100M
> comparisons.
>
> On the future direction, with 0.11, we are enabling storing of bloom
> filters and column ranges inside the Hudi metadata table (MDT). *(what
> we
> call multi modal indexes).
> This helps us make the access more resilient towards cloud storage
> throttling and also more performant (we need to read much fewer files)
>
> Over time, when this mechanism is stable, we plan to stop writing out
> bloom
> filters in parquet and also integrate the Hudi MDT with different query
> engines for point-ish lookups.
>
> Hope that helps
>
> Thanks
> Vinoth
>
>
>
>
> On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris <[email protected]>
> wrote:
>
> > Hi,
> >
> > spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
> > arbirtrary columns. I wonder if:
> >
> > - hudi can benefit from them ? (likely in 0.11, but not with MOR tables)
> > - would make sense to replace the hudi blooms with them ?
> > - what would be the advantage of storing our blooms in hfiles (AFAIK
> >   this is the future expected implementation) over the parquet built-in.
> >
> >
> > here is the syntax:
> >
> >     .option("parquet.bloom.filter.enabled#favorite_color", "true")
> >     .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000")
> >
> >
> > and here some code to illustrate :
> >
> >
> > https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654
> >
> >
> >
> > thx
> >

Re: spark 3.2.1 built-in bloom filters

Reply via email to