Thanks André for raising this!
+1 to adding write.parquet.bloom-filter-ndv.column.<col> to configure NDV.
For the “FPP without NDV” case, let’s defer to the Parquet community (error
vs ignore vs default NDV); Iceberg will follow their decision. Would you
like to start a thread on parquet-dev, or I’m happy to do it?

Thanks,
Huaxin

On Wed, Sep 17, 2025 at 3:46 AM André Rosa <[email protected]>
wrote:

> Hello everyone,
> while working on a parquet writer, I found an issue related to the bloom
> filter table properties.
>
> Currently, the iceberg specification
> <https://iceberg.apache.org/docs/latest/configuration/#write-properties>
> defines 3 table properties for configuring bloom filters:
>
> write.parquet.bloom-filter-enabled.column.col1
>
> (not set)
>
> Hint to parquet to write a bloom filter for the column: 'col1'
>
> write.parquet.bloom-filter-max-bytes
>
> 1048576 (1 MB)
>
> The maximum number of bytes for a bloom filter bitset
>
> write.parquet.bloom-filter-fpp.column.col1
>
> 0.01
>
> The false positive probability for a bloom filter applied to 'col1' (must
> > 0.0 and < 1.0)
>
> Looking at the parquet-java implementation
> <https://github.com/apache/parquet-java/blob/36a5f9cf8c1ce2c19631a0ec376665c5e41ea215/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L179-L192>,
> the fpp value for a given column is ignored if the ndv for that column is
> not specified.
>
> Being that the iceberg spec does not define a property for this and that
> there is no default, the implementation always ignores the fpp property and 
> uses
> the bloom-filter-max-bytes as the exact size instead
> <https://github.com/apache/parquet-java/blob/299b0aea128645312badc329479920ddf8736577/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L205-L217>
> (if the bloom filter is enabled for the column).
>
>
> My proposal is to define a new table property
> 'write.parquet.bloom-filter-ndv.column.col1' in the spec to enable
> configuring the ndv to use.
>
> In addition, it also should be discussed if not specifying the ndv but
> specifying the fpp should be a config "error" (or simply ignored like
> parquet-java is doing) or if it should use a default ndv instead.
>
> What do you think should be done regarding this?
>
> Best regards,
> André Rosa
>

Reply via email to