Hello Iceberg Devs,
I'v been tracking an issue with predicate pushdowns in
Iceberg on complex types. I have compared vanilla Spark reader over Parquet
vs. Iceberg format reader. I have an example detailing it here:
https://github.com/apache/incubator-iceberg/issues/99
*Vanilla Spark Parquet reader plan*
== Physical Plan ==
*(1) Project [age#428, name#429, friends#430, location#431]
+- *(1) Filter (isnotnull(friends#430) && (friends#430[Josh] = 10))
+- *(1) FileScan parquet [age#428,name#429,friends#430,location#431]
Batched: false, Format: Parquet, Location:
InMemoryFileIndex[file:/usr/local/spark/test/parquet-people-complex],
PartitionFilters: [], PushedFilters: [IsNotNull(friends)], ReadSchema:
struct<age:int,name:string,friends:map<string,int>,location:struct<lat:int,lon:int>>
* Iceberg Plan*
== Physical Plan ==
*(1) Project [age#33, name#34, friends#35]
+- *(1) Filter ((friends#35[Josh] = 10) && isnotnull(friends#35))
+- *(1) ScanV2 iceberg[age#33, name#34, friends#35] (Filters:
[isnotnull(friends#35)], Options: [path=iceberg-people-complex2,paths=[]])
*Couple of points :*
1) Complex predicate is not passed down to the Scan level in both plans.
The complex predicate is termed "non-translateable" by
*DataSourceStrategy.translateFilter() *[1] when trying to convert Catalyst
expression to data source filter. Ryan & Xabriel had a discussion earlier
on this list about Spark not passing expressions to data source (in certain
cases). This might be related to that. Maybe a path forward is to fix that
translation in Spark so that Iceberg Filter conversion has a chance to
handle complex type. Currently Iceberg Reader code is unaware of that
filter.
2) Although both vanilla Spark and Iceberg handle complex type predicates
post scan, this regression is caused by post scan filtering not returning
results in the Iceberg case. I think post scan filtering is unable to
handle Iceberg format. So if 1) is not the way forward then the alternative
way is to fix this in the post scan filtering.
Looking forward to your guidance on the way forward.
Cheers,
-Gautam.
[1] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L450