GitHub user pwoody opened a pull request: https://github.com/apache/spark/pull/19629
[SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages ## What changes were proposed in this pull request? Adding a global limit on top of the distinct values before sorting and collecting will reduce the overall work in the case where we have more distinct values. We will also eagerly perform a collect rather than a take because we know we only have at most (maxValues + 1) rows. ## How was this patch tested? Existing tests cover sorted order You can merge this pull request into a Git repository by running: $ git pull https://github.com/pwoody/spark SPARK-22408 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19629.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19629 ---- commit aa809e39baf222e698315a5efb2d583cab99aad7 Author: Patrick Woody <pwo...@palantir.com> Date: 2017-11-01T15:44:51Z SPARK-22408: reduce work of calculating pivot distinct values ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org