Count) push down for ORC

Thomas Graves (Jira) Tue, 08 Mar 2022 15:20:21 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503221#comment-17503221
 ]


Thomas Graves commented on SPARK-34960:
---------------------------------------

if I'm reading the orc spec right the ColumnStatistics footer are optional in 
Orc, correct?

I'm assuming that is why PR says "If the file does not have valid statistics, 
Spark will throw exception and fail query."     I guess the only way to know 
its there or not is to read it so we can't really determine ahead of time?

This seems like behavior that should be documented in the very least.  I want 
to make sure I'm not missing something here.

> Aggregate (Min/Max/Count) push down for ORC
> -------------------------------------------
>
>                 Key: SPARK-34960
>                 URL: https://issues.apache.org/jira/browse/SPARK-34960
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Cheng Su
>            Assignee: Cheng Su
>            Priority: Minor
>             Fix For: 3.3.0
>
>
> Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we 
> can also push down certain aggregations into ORC. ORC exposes column 
> statistics in interface `org.apache.orc.Reader` 
> ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118]
>  ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC

Reply via email to