[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

Kazuaki Ishizaki (JIRA) Wed, 01 Mar 2017 22:39:59 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891699#comment-15891699
 ]


Kazuaki Ishizaki commented on SPARK-19503:
------------------------------------------

If it is good to leave sort intact for now, do we prune only sort for local 
(i.e. {{Sort(_, false, _)}})?

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end 
> result such as df.sort(...).count()
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19503
>                 URL: https://issues.apache.org/jira/browse/SPARK-19503
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 2.1.0
>         Environment: Perhaps only a pyspark or databricks AWS issue
>            Reporter: R
>            Priority: Minor
>              Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not 
> required here and makes me wonder how smart the algebraic optimiser is 
> indeed! The data may be partitioned by known count (such as parquet files) 
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder 
> what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

Reply via email to