psvri commented on PR #15070:
URL: https://github.com/apache/iceberg/pull/15070#issuecomment-3986957475
Hello,
Really sorry for the late reply, things got it the way. The issue happens
when a data file contains nan count , lower bound and upper bound. The pyspark
code snippet shows the same.
```
spark = SparkSession.builder \
.appName("HadoopCatalogExample") \
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.spark_catalog.type", "hadoop") \
.config("spark.sql.catalog.spark_catalog.warehouse",
"/tmp/iceberg_warehouse") \
.config("spark.jars.packages",
"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1") \
.getOrCreate()
table_name = 'temp'
spark.sql(f"CREATE TABLE {table_name} (id int, data float) USING iceberg
PARTITIONED BY (id)").show()
spark.sql(f"INSERT INTO {table_name} VALUES (2, 2), (2, float('nan'))")
spark.sql(f"select readable_metrics.data.* from
default.{table_name}.files").show(truncate=False, vertical=True)
-RECORD 0---------------
column_size | 49
value_count | 2
null_value_count | 0
nan_value_count | 1
lower_bound | 2.0
upper_bound | 2.0
spark.conf.set("spark.sql.iceberg.aggregate-push-down.enabled", "true")
spark.sql(f"select max(data) from {table_name}").show()
+---------+
|max(data)|
+---------+
| 2.0|
+---------+
spark.conf.set("spark.sql.iceberg.aggregate-push-down.enabled", "false")
spark.sql(f"select max(data) from {table_name}").show()
+---------+
|max(data)|
+---------+
| NaN|
+---------+
```
Now coming to why the bug was not triggered with the unit test case. If we
inspect the redable metrics of the datafiles created in the unit test case we
see the below
```
-RECORD 0----------------
column_size | 78
value_count | 2
null_value_count | 0
nan_value_count | 2
lower_bound | NULL
upper_bound | NULL
-RECORD 1----------------
column_size | 49
value_count | 2
null_value_count | 0
nan_value_count | 1
lower_bound | 2.0
upper_bound | 2.0
-RECORD 2----------------
column_size | 49
value_count | 2
null_value_count | 0
nan_value_count | 1
lower_bound | 1.0
upper_bound | 1.0
```
So basically, we have one file where lower bound and upper bound is set as
null. This means
[MaxAggregate.hasValue](https://github.com/apache/iceberg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java#L42)
will return false for this file and that in turn will make isvalid to be false
[here](https://github.com/apache/iceberg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java#L151)
. Next in
[SparkScanBuilder](https://github.com/apache/iceberg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L149)
we [check if each aggregate is
valid](https://github.com/apache/iceberg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java#L90).
Because `isvalid ` is set to false, due to [this
return](https://github.com/apache/icebe
rg/blob/cf25bed6f41627e6d8d343ef63845d88209b77fd/api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java#L172)
the check will fail and hence aggregate pushdown dosent happen.
Now why my one line change in UT triggers the bug is due to fact that we
dont encounter a scenario where one file contains only nan and dosent contains
lower bound and upper bound as shown below. Because all files have lower bound
and upper bounds, isValid is not set to false. So the check in ScanBuilder
returns true and pushdown happens.
```
-RECORD 0----------------
column_size | 83
value_count | 3
null_value_count | 0
nan_value_count | 2
lower_bound | 10.0
upper_bound | 10.0
-RECORD 1----------------
column_size | 49
value_count | 2
null_value_count | 0
nan_value_count | 1
lower_bound | 2.0
upper_bound | 2.0
-RECORD 2----------------
column_size | 49
value_count | 2
null_value_count | 0
nan_value_count | 1
lower_bound | 1.0
upper_bound | 1.0
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]