Hi,
spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
arbirtrary columns. I wonder if:
- hudi can benefit from them ? (likely in 0.11, but not with MOR tables)
- would make sense to replace the hudi blooms with them ?
- what would be the advantage of storing our blooms in hfiles (AFAIK
this is the future expected implementation) over the parquet built-in.
here is the syntax:
.option("parquet.bloom.filter.enabled#favorite_color", "true")
.option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000")
and here some code to illustrate :
https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654
thx