Robert Joseph Evans created SPARK-40280:
-------------------------------------------

             Summary: Failure to create parquet predicate push down for ints 
and longs on some valid files
                 Key: SPARK-40280
                 URL: https://issues.apache.org/jira/browse/SPARK-40280
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.4.0
            Reporter: Robert Joseph Evans


The [parquet 
format|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#signed-integers]
 specification states that...

bq. {{{}INT(8, true){}}}, {{{}INT(16, true){}}}, and {{INT(32, true)}} must 
annotate an {{int32}} primitive type and {{INT(64, true)}} must annotate an 
{{int64}} primitive type. {{INT(32, true)}} and {{INT(64, true)}} are implied 
by the {{int32}} and {{int64}} primitive types if no other annotation is 
present and should be considered optional.

But the code inside of 
[ParquetFilters.scala|https://github.com/apache/spark/blob/296fe49ec855ac8c15c080e7bab6d519fe504bd3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L125-L126]
 requires that for {{int32}} and {{int64}} that there be no annotation. If 
there is an annotation for those columns and they are a part of a predicate 
push down, the hard coded types will not match and the corresponding filter 
ends up being {{None}}.

This can be a huge performance penalty for a valid parquet file.

I am happy to provide files that show the issue if needed for testing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to