[ https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504500#comment-17504500 ]
Cheng Su commented on SPARK-34960: ---------------------------------- Thanks [~tgraves] and [~ahussein] for commenting, and yes, if any ORC file of table is missing statistics at file footer, the Spark query with aggregate push down would be failed loudly. I agree this is not good for user experience, and we are planning to work on runtime fallback to read from real rows in ORC file if no statistics. For now, if you have any concern to the feature, feel free to not enable in your environment, and that's the reason why we disable the feature by default to avoid failing any existing Spark workload. For now I will create a PR to add more documentation to mention the behavior i.e. fail the query if any file missing statistics. For Spark 3.4/next next release, the runtime fallback logic will probably be added as it's too tight to work on the feature for Spark 3.3 (we are doing branch cut in this month), and we have similar problem for Parquet aggregate push down as well. > Aggregate (Min/Max/Count) push down for ORC > ------------------------------------------- > > Key: SPARK-34960 > URL: https://issues.apache.org/jira/browse/SPARK-34960 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.2.0 > Reporter: Cheng Su > Assignee: Cheng Su > Priority: Minor > Fix For: 3.3.0 > > Attachments: file_no_stats-orc.tar.gz > > > Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we > can also push down certain aggregations into ORC. ORC exposes column > statistics in interface `org.apache.orc.Reader` > ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118] > ), where Spark can utilize for aggregation push down. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org