parisni commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1936917209
Hi @bk-mz thanks for the interest in parquet bloom filter. We have [an open documentation](https://github.com/apache/hudi/pull/9056/files) about bloom filters which states: > So bloom would be useful in either case (at the parquet file level) : > - the column has no duplicates > - the column number of unique values is more than 40k If your column is not in this case, then parquet bloom would only add overhead, and would slow down a given query. There is also [benchmarks on spark side](https://github.com/apache/spark/blob/master/sql/core/benchmarks/BloomFilterBenchmark-results.txt) that could be of interest -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org