[jira] [Updated] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

R (JIRA) Tue, 07 Feb 2017 14:56:11 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


R updated SPARK-19503:
----------------------
    Summary: Execution Plan Optimizer: avoid sort or shuffle when it does not 
change end result such as df.sort(...).count()  (was: Dumb Execution Plan)

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end 
> result such as df.sort(...).count()
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19503
>                 URL: https://issues.apache.org/jira/browse/SPARK-19503
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 2.1.0
>         Environment: Perhaps only a pyspark or databricks AWS issue
>            Reporter: R
>            Priority: Minor
>              Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not 
> required here and makes me wonder how smart the algebraic optimiser is 
> indeed! The data may be partitioned by known count (such as parquet files) 
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder 
> what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

Reply via email to