Thanks André for raising this! +1 to adding write.parquet.bloom-filter-ndv.column.<col> to configure NDV. For the “FPP without NDV” case, let’s defer to the Parquet community (error vs ignore vs default NDV); Iceberg will follow their decision. Would you like to start a thread on parquet-dev, or I’m happy to do it?
Thanks, Huaxin On Wed, Sep 17, 2025 at 3:46 AM André Rosa <[email protected]> wrote: > Hello everyone, > while working on a parquet writer, I found an issue related to the bloom > filter table properties. > > Currently, the iceberg specification > <https://iceberg.apache.org/docs/latest/configuration/#write-properties> > defines 3 table properties for configuring bloom filters: > > write.parquet.bloom-filter-enabled.column.col1 > > (not set) > > Hint to parquet to write a bloom filter for the column: 'col1' > > write.parquet.bloom-filter-max-bytes > > 1048576 (1 MB) > > The maximum number of bytes for a bloom filter bitset > > write.parquet.bloom-filter-fpp.column.col1 > > 0.01 > > The false positive probability for a bloom filter applied to 'col1' (must > > 0.0 and < 1.0) > > Looking at the parquet-java implementation > <https://github.com/apache/parquet-java/blob/36a5f9cf8c1ce2c19631a0ec376665c5e41ea215/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L179-L192>, > the fpp value for a given column is ignored if the ndv for that column is > not specified. > > Being that the iceberg spec does not define a property for this and that > there is no default, the implementation always ignores the fpp property and > uses > the bloom-filter-max-bytes as the exact size instead > <https://github.com/apache/parquet-java/blob/299b0aea128645312badc329479920ddf8736577/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L205-L217> > (if the bloom filter is enabled for the column). > > > My proposal is to define a new table property > 'write.parquet.bloom-filter-ndv.column.col1' in the spec to enable > configuring the ndv to use. > > In addition, it also should be discussed if not specifying the ndv but > specifying the fpp should be a config "error" (or simply ignored like > parquet-java is doing) or if it should use a default ndv instead. > > What do you think should be done regarding this? > > Best regards, > André Rosa >
