Count) push down for ORC

Cheng Su (Jira) Thu, 10 Mar 2022 10:17:04 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504500#comment-17504500
 ]


Cheng Su commented on SPARK-34960:
----------------------------------

Thanks [~tgraves] and [~ahussein] for commenting, and yes, if any ORC file of 
table is missing statistics at file footer, the Spark query with aggregate push 
down would be failed loudly. I agree this is not good for user experience, and 
we are planning to work on runtime fallback to read from real rows in ORC file 
if no statistics.

For now, if you have any concern to the feature, feel free to not enable in 
your environment, and that's the reason why we disable the feature by default 
to avoid failing any existing Spark workload.

For now I will create a PR to add more documentation to mention the behavior 
i.e. fail the query if any file missing statistics. For Spark 3.4/next next 
release, the runtime fallback logic will probably be added as it's too tight to 
work on the feature for Spark 3.3 (we are doing branch cut in this month), and 
we have similar problem for Parquet aggregate push down as well.

> Aggregate (Min/Max/Count) push down for ORC
> -------------------------------------------
>
>                 Key: SPARK-34960
>                 URL: https://issues.apache.org/jira/browse/SPARK-34960
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Cheng Su
>            Assignee: Cheng Su
>            Priority: Minor
>             Fix For: 3.3.0
>
>         Attachments: file_no_stats-orc.tar.gz
>
>
> Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we 
> can also push down certain aggregations into ORC. ORC exposes column 
> statistics in interface `org.apache.orc.Reader` 
> ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118]
>  ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC

Reply via email to