laskoviymishka commented on code in PR #878:
URL: https://github.com/apache/iceberg-go/pull/878#discussion_r3074279308
##########
table/internal/parquet_files.go:
##########
@@ -259,8 +259,27 @@ func (parquetFormat) GetWriteProperties(props
iceberg.Properties) any {
slog.Warn("unrecognized compression codec, falling back to
uncompressed", "codec", compression)
}
- return append(writerProps, parquet.WithCompression(codec),
+ writerProps = append(writerProps, parquet.WithCompression(codec),
parquet.WithCompressionLevel(compressionLevel))
+
+ // Bloom filter properties.
+ // write.parquet.bloom-filter-max-bytes caps the per-column bloom
filter size.
+ bloomMaxBytes := props.GetInt(ParquetBloomFilterMaxBytesKey,
ParquetBloomFilterMaxBytesDefault)
+ writerProps = append(writerProps,
parquet.WithMaxBloomFilterBytes(int64(bloomMaxBytes)))
+
+ // write.parquet.bloom-filter-enabled.column.<col-name> enables bloom
filters
+ // for individual columns. Scan all properties for the prefix.
+ prefix := ParquetBloomFilterColumnEnabledKeyPrefix + "."
+ for key, val := range props {
+ colName, ok := strings.CutPrefix(key, prefix)
+ if !ok || colName == "" {
+ continue
+ }
+ enabled := strings.EqualFold(val, "true")
Review Comment:
This means "false", "0", "yes", "1", or any typo all silently map to false.
Matches Java's Boolean.parseBoolean, so it's **correct**.
But worth a one-line comment since Go developers would expect
strconv.ParseBool here, and the difference is real (strconv.ParseBool("1") is
true).
##########
table/internal/parquet_files_test.go:
##########
@@ -691,6 +691,33 @@ func TestGetWritePropertiesPageVersion(t *testing.T) {
}
}
+func TestGetWritePropertiesBloomFilter(t *testing.T) {
+ format := internal.GetFileFormat(iceberg.ParquetFile)
+
+ t.Run("default max bloom filter bytes", func(t *testing.T) {
+ wp :=
parquet.NewWriterProperties(format.GetWriteProperties(iceberg.Properties{}).([]parquet.WriterProperty)...)
+ assert.Equal(t,
int64(internal.ParquetBloomFilterMaxBytesDefault), wp.MaxBloomFilterBytes())
+ })
+
+ t.Run("custom max bloom filter bytes", func(t *testing.T) {
+ props := iceberg.Properties{
+ internal.ParquetBloomFilterMaxBytesKey: "2097152",
+ }
+ wp :=
parquet.NewWriterProperties(format.GetWriteProperties(props).([]parquet.WriterProperty)...)
+ assert.Equal(t, int64(2097152), wp.MaxBloomFilterBytes())
+ })
+
+ t.Run("per-column bloom filter enabled", func(t *testing.T) {
+ props := iceberg.Properties{
+ internal.ParquetBloomFilterColumnEnabledKeyPrefix +
".id": "true",
+ internal.ParquetBloomFilterColumnEnabledKeyPrefix +
".name": "false",
+ }
+ wp :=
parquet.NewWriterProperties(format.GetWriteProperties(props).([]parquet.WriterProperty)...)
+ assert.True(t, wp.BloomFilterEnabledFor("id"))
Review Comment:
Also assert that a column NOT mentioned in properties returns false (the
parquet default):
assert.False(t, wp.BloomFilterEnabledFor("unmentioned_col"))
This pins the "no property = no bloom filter" contract, which is the
important default behavior.
##########
table/internal/parquet_files.go:
##########
@@ -259,8 +259,27 @@ func (parquetFormat) GetWriteProperties(props
iceberg.Properties) any {
slog.Warn("unrecognized compression codec, falling back to
uncompressed", "codec", compression)
}
- return append(writerProps, parquet.WithCompression(codec),
+ writerProps = append(writerProps, parquet.WithCompression(codec),
Review Comment:
Nit: this is set unconditionally even when no bloom filters are enabled for
any column. Harmless since parquet ignores it when no bloom filters are
written, but slightly noisy. Not blocking.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]