hussein-awala opened a new issue, #9898:
URL: https://github.com/apache/iceberg/issues/9898
### Apache Iceberg version
1.4.3 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
I have an Iceberg table, and I want to create two bloom filters on a root
string column and nested string column in a struct, I've set the properties
`write.parquet.bloom-filter-enabled.column.a` and
`write.parquet.bloom-filter-enabled.column.b.c` to `true`, and I checked with
`parquet-cli`:
```bash
$ parquet bloom-filter /path/to/file.parquet -c a -v <not existing value>
Row group 0:
--------------------------------------------------------------------------------
value <not existing value> NOT exists.
$ parquet bloom-filter /path/to/file.parquet -c a -v <existing value>
Row group 0:
--------------------------------------------------------------------------------
value <existing value> maybe exists.
$ parquet bloom-filter /path/to/file.parquet -c b.c -v <some value>
Row group 0:
--------------------------------------------------------------------------------
column b.c has no bloom filter
# check if it's an issue with column name parsing:
$ parquet bloom-filter /path/to/file.parquet -c b.d -v <some value>
Argument error: Schema doesn't have column: b.d
```
However, I tried with Spark and parquet, and it worker without any issue:
```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import spark.implicits._
val schema = StructType(Array(
StructField("a", StringType, true),
StructField("b", StringType, true),
StructField("nested", StructType(Array(
StructField("c", StringType, true),
StructField("d", StringType, true)
)), true)
))
val data = Seq(
Row("1", "25", Row("100", "a")),
Row("2", "30", Row("200", "b")),
Row("3", "35", Row("300", "c")),
Row("4", "40", Row("400", "d")),
Row("5", "45", Row("500", "e"))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
df.write.format("parquet")
.option("parquet.bloom.filter.enabled#a", "true")
.option("parquet.bloom.filter.enabled#nested.c", "true")
.save("bloom_parquet")
```
Check with `parquet-cli`
```bash
$ github parquet bloom-filter
bloom_parquet/part-00002-9fac4c38-7113-45df-8db9-d96c3f6b6a8e-c000.snappy.parquet
-c a -v "1"
Row group 0:
--------------------------------------------------------------------------------
value 1 maybe exists.
$ github parquet bloom-filter
bloom_parquet/part-00002-9fac4c38-7113-45df-8db9-d96c3f6b6a8e-c000.snappy.parquet
-c nested.c -v "1"
Row group 0:
--------------------------------------------------------------------------------
value 1 NOT exists.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]