laskoviymishka commented on code in PR #878:
URL: https://github.com/apache/iceberg-go/pull/878#discussion_r3074279308


##########
table/internal/parquet_files.go:
##########
@@ -259,8 +259,27 @@ func (parquetFormat) GetWriteProperties(props 
iceberg.Properties) any {
                slog.Warn("unrecognized compression codec, falling back to 
uncompressed", "codec", compression)
        }
 
-       return append(writerProps, parquet.WithCompression(codec),
+       writerProps = append(writerProps, parquet.WithCompression(codec),
                parquet.WithCompressionLevel(compressionLevel))
+
+       // Bloom filter properties.
+       // write.parquet.bloom-filter-max-bytes caps the per-column bloom 
filter size.
+       bloomMaxBytes := props.GetInt(ParquetBloomFilterMaxBytesKey, 
ParquetBloomFilterMaxBytesDefault)
+       writerProps = append(writerProps, 
parquet.WithMaxBloomFilterBytes(int64(bloomMaxBytes)))
+
+       // write.parquet.bloom-filter-enabled.column.<col-name> enables bloom 
filters
+       // for individual columns. Scan all properties for the prefix.
+       prefix := ParquetBloomFilterColumnEnabledKeyPrefix + "."
+       for key, val := range props {
+               colName, ok := strings.CutPrefix(key, prefix)
+               if !ok || colName == "" {
+                       continue
+               }
+               enabled := strings.EqualFold(val, "true")

Review Comment:
   This means "false", "0", "yes", "1", or any typo all silently map to false. 
   
   Matches Java's Boolean.parseBoolean, so it's **correct**. 
   
   But worth a one-line comment since Go developers would expect 
strconv.ParseBool here, and the difference is real (strconv.ParseBool("1") is 
true).



##########
table/internal/parquet_files_test.go:
##########
@@ -691,6 +691,33 @@ func TestGetWritePropertiesPageVersion(t *testing.T) {
        }
 }
 
+func TestGetWritePropertiesBloomFilter(t *testing.T) {
+       format := internal.GetFileFormat(iceberg.ParquetFile)
+
+       t.Run("default max bloom filter bytes", func(t *testing.T) {
+               wp := 
parquet.NewWriterProperties(format.GetWriteProperties(iceberg.Properties{}).([]parquet.WriterProperty)...)
+               assert.Equal(t, 
int64(internal.ParquetBloomFilterMaxBytesDefault), wp.MaxBloomFilterBytes())
+       })
+
+       t.Run("custom max bloom filter bytes", func(t *testing.T) {
+               props := iceberg.Properties{
+                       internal.ParquetBloomFilterMaxBytesKey: "2097152",
+               }
+               wp := 
parquet.NewWriterProperties(format.GetWriteProperties(props).([]parquet.WriterProperty)...)
+               assert.Equal(t, int64(2097152), wp.MaxBloomFilterBytes())
+       })
+
+       t.Run("per-column bloom filter enabled", func(t *testing.T) {
+               props := iceberg.Properties{
+                       internal.ParquetBloomFilterColumnEnabledKeyPrefix + 
".id":   "true",
+                       internal.ParquetBloomFilterColumnEnabledKeyPrefix + 
".name": "false",
+               }
+               wp := 
parquet.NewWriterProperties(format.GetWriteProperties(props).([]parquet.WriterProperty)...)
+               assert.True(t, wp.BloomFilterEnabledFor("id"))

Review Comment:
   Also assert that a column NOT mentioned in properties returns false (the 
parquet default):
   
   assert.False(t, wp.BloomFilterEnabledFor("unmentioned_col"))
   
   This pins the "no property = no bloom filter" contract, which is the 
important default behavior.



##########
table/internal/parquet_files.go:
##########
@@ -259,8 +259,27 @@ func (parquetFormat) GetWriteProperties(props 
iceberg.Properties) any {
                slog.Warn("unrecognized compression codec, falling back to 
uncompressed", "codec", compression)
        }
 
-       return append(writerProps, parquet.WithCompression(codec),
+       writerProps = append(writerProps, parquet.WithCompression(codec),

Review Comment:
   Nit: this is set unconditionally even when no bloom filters are enabled for 
any column. Harmless since parquet ignores it when no bloom filters are 
written, but slightly noisy. Not blocking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to