parisni commented on issue #10511:
URL: https://github.com/apache/hudi/issues/10511#issuecomment-1936917209

   Hi @bk-mz thanks for the interest in parquet bloom filter. We have [an open 
documentation](https://github.com/apache/hudi/pull/9056/files) about bloom 
filters which states:
   
   > So bloom would be useful in either case (at the parquet file level) :
   > - the column has no duplicates
   > - the column number of unique values is more than 40k
   
   If your column is not in this case, then parquet bloom would only add 
overhead, and would slow down a given query.
   
   There is also [benchmarks on spark 
side](https://github.com/apache/spark/blob/master/sql/core/benchmarks/BloomFilterBenchmark-results.txt)
 that could be of interest 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to