Count) push down for ORC

Ahmed Hussein (Jira) Thu, 10 Mar 2022 07:56:03 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504374#comment-17504374
 ]


Ahmed Hussein commented on SPARK-34960:
---------------------------------------

Thanks [~chengsu] for putting up the optimization on pushed aggregates.
I am concerned that the changes introduced in this jira leads to inconsistent 
behavior in the following scenario:
 * Assume an ORC file with empty column statistics (no_col_stats.orc).
 * Run a read job as {{spark.read.orc(path).selectExpr('count(p)')}} with 
default configuration. This will be fine.
 * Now, enable {{'spark.sql.orc.aggregatePushdown': 'true'}} and re-run. There 
will be an exception because the new code assumes that an ORC file must have 
file statistics.

In other words, enabling {{spark.sql.orc.aggregatePushdown}} will cause read 
jobs to fail on any ORC file with empty statistics.
This is going to be problematic for users because they would have to identify 
all ORC files or they would risk failing their jobs at runtime.

Note that according [ORC-specs|https://orc.apache.org/specification], the 
statistics are optional even for the futuristic ORCV2.

I second [~tgraves] that there should be a way to recover safely if those 
fields are missing.

> Aggregate (Min/Max/Count) push down for ORC
> -------------------------------------------
>
>                 Key: SPARK-34960
>                 URL: https://issues.apache.org/jira/browse/SPARK-34960
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Cheng Su
>            Assignee: Cheng Su
>            Priority: Minor
>             Fix For: 3.3.0
>
>         Attachments: file_no_stats-orc.tar.gz
>
>
> Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we 
> can also push down certain aggregations into ORC. ORC exposes column 
> statistics in interface `org.apache.orc.Reader` 
> ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118]
>  ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC

Reply via email to