spark 3.2.1 built-in bloom filters

Nicolas Paris Mon, 28 Mar 2022 09:57:41 -0700

Hi,

spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
arbirtrary columns. I wonder if:


- hudi can benefit from them ? (likely in 0.11, but not with MOR tables)
- would make sense to replace the hudi blooms with them ?
- what would be the advantage of storing our blooms in hfiles (AFAIK
  this is the future expected implementation) over the parquet built-in.


here is the syntax:

    .option("parquet.bloom.filter.enabled#favorite_color", "true")
    .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000")


and here some code to illustrate :

https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654



thx

spark 3.2.1 built-in bloom filters

Reply via email to