spark git commit: [SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

rxin Thu, 02 Nov 2017 06:20:06 -0700

Repository: spark
Updated Branches:
  refs/heads/master 849b465bb -> 277b1924b



[SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation 
launches unnecessary stages

## What changes were proposed in this pull request?

Adding a global limit on top of the distinct values before sorting and 
collecting will reduce the overall work in the case where we have more distinct 
values. We will also eagerly perform a collect rather than a take because we 
know we only have at most (maxValues + 1) rows.

## How was this patch tested?

Existing tests cover sorted order

Author: Patrick Woody <pwo...@palantir.com>

Closes #19629 from pwoody/SPARK-22408.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/277b1924
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/277b1924
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/277b1924

Branch: refs/heads/master
Commit: 277b1924b46a70ab25414f5670eb784906dbbfdf
Parents: 849b465
Author: Patrick Woody <pwo...@palantir.com>
Authored: Thu Nov 2 14:19:21 2017 +0100
Committer: Reynold Xin <r...@databricks.com>
Committed: Thu Nov 2 14:19:21 2017 +0100

----------------------------------------------------------------------
 .../scala/org/apache/spark/sql/RelationalGroupedDataset.scala    | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/277b1924/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
index 21e94fa..3e4edd4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
@@ -321,10 +321,10 @@ class RelationalGroupedDataset protected[sql](
     // Get the distinct values of the column and sort them so its consistent
     val values = df.select(pivotColumn)
       .distinct()
+      .limit(maxValues + 1)
       .sort(pivotColumn)  // ensure that the output columns are in a 
consistent logical order
-      .rdd
+      .collect()
       .map(_.get(0))
-      .take(maxValues + 1)
       .toSeq
 
     if (values.length > maxValues) {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

Reply via email to