[ https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li resolved SPARK-23957. ----------------------------- Resolution: Fixed Assignee: Henry Robinson Fix Version/s: 2.4.0 > Sorts in subqueries are redundant and can be removed > ---------------------------------------------------- > > Key: SPARK-23957 > URL: https://issues.apache.org/jira/browse/SPARK-23957 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: Henry Robinson > Assignee: Henry Robinson > Priority: Major > Fix For: 2.4.0 > > > Unless combined with a {{LIMIT}}, there's no correctness reason that planned > and optimized subqueries should have any sort operators (since the result of > the subquery is an unordered collection of tuples). > For example: > {{SELECT count(1) FROM (select id FROM dft ORDER by id)}} > has the following plan: > {code:java} > == Physical Plan == > *(3) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition > +- *(2) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(2) Project > +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) > +- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<id:bigint> > {code} > ... but the sort operator is redundant. > Less intuitively, the sort is also redundant in selections from an ordered > subquery: > {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}} > has plan: > {code:java} > == Physical Plan == > *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) > +- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> > {code} > ... but again, since the subquery returns a bag of tuples, the sort is > unnecessary. > We should consider adding an optimizer rule that removes a sort inside a > subquery. SPARK-23375 is related, but removes sorts that are functionally > redundant because they perform the same ordering. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org