Repository: spark Updated Branches: refs/heads/master 849b465bb -> 277b1924b
[SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages ## What changes were proposed in this pull request? Adding a global limit on top of the distinct values before sorting and collecting will reduce the overall work in the case where we have more distinct values. We will also eagerly perform a collect rather than a take because we know we only have at most (maxValues + 1) rows. ## How was this patch tested? Existing tests cover sorted order Author: Patrick Woody <pwo...@palantir.com> Closes #19629 from pwoody/SPARK-22408. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/277b1924 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/277b1924 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/277b1924 Branch: refs/heads/master Commit: 277b1924b46a70ab25414f5670eb784906dbbfdf Parents: 849b465 Author: Patrick Woody <pwo...@palantir.com> Authored: Thu Nov 2 14:19:21 2017 +0100 Committer: Reynold Xin <r...@databricks.com> Committed: Thu Nov 2 14:19:21 2017 +0100 ---------------------------------------------------------------------- .../scala/org/apache/spark/sql/RelationalGroupedDataset.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/277b1924/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala index 21e94fa..3e4edd4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala @@ -321,10 +321,10 @@ class RelationalGroupedDataset protected[sql]( // Get the distinct values of the column and sort them so its consistent val values = df.select(pivotColumn) .distinct() + .limit(maxValues + 1) .sort(pivotColumn) // ensure that the output columns are in a consistent logical order - .rdd + .collect() .map(_.get(0)) - .take(maxValues + 1) .toSeq if (values.length > maxValues) { --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org