[ https://issues.apache.org/jira/browse/SPARK-40280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves resolved SPARK-40280. ----------------------------------- Fix Version/s: 3.4.0 3.3.1 3.2.3 Assignee: Robert Joseph Evans Resolution: Fixed > Failure to create parquet predicate push down for ints and longs on some > valid files > ------------------------------------------------------------------------------------ > > Key: SPARK-40280 > URL: https://issues.apache.org/jira/browse/SPARK-40280 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > > The [parquet > format|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#signed-integers] > specification states that... > bq. {{{}INT(8, true){}}}, {{{}INT(16, true){}}}, and {{INT(32, true)}} must > annotate an {{int32}} primitive type and {{INT(64, true)}} must annotate an > {{int64}} primitive type. {{INT(32, true)}} and {{INT(64, true)}} are implied > by the {{int32}} and {{int64}} primitive types if no other annotation is > present and should be considered optional. > But the code inside of > [ParquetFilters.scala|https://github.com/apache/spark/blob/296fe49ec855ac8c15c080e7bab6d519fe504bd3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L125-L126] > requires that for {{int32}} and {{int64}} that there be no annotation. If > there is an annotation for those columns and they are a part of a predicate > push down, the hard coded types will not match and the corresponding filter > ends up being {{None}}. > This can be a huge performance penalty for a valid parquet file. > I am happy to provide files that show the issue if needed for testing. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org