[ https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
R updated SPARK-19503: ---------------------- Summary: Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count() (was: Dumb Execution Plan) > Execution Plan Optimizer: avoid sort or shuffle when it does not change end > result such as df.sort(...).count() > --------------------------------------------------------------------------------------------------------------- > > Key: SPARK-19503 > URL: https://issues.apache.org/jira/browse/SPARK-19503 > Project: Spark > Issue Type: Bug > Components: Optimizer > Affects Versions: 2.1.0 > Environment: Perhaps only a pyspark or databricks AWS issue > Reporter: R > Priority: Minor > Labels: execution, optimizer, plan, query > > df.sort(...).count() > performs shuffle and sort and then count! This is wasteful as sort is not > required here and makes me wonder how smart the algebraic optimiser is > indeed! The data may be partitioned by known count (such as parquet files) > and we should not shuffle to just perform count. > This may look trivial, but if optimiser fails to recognise this, I wonder > what else is it missing especially in more complex operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org