[GitHub] spark pull request #19629: [SPARK-22408][SQL] RelationalGroupedDataset's dis...

pwoody Wed, 01 Nov 2017 09:10:33 -0700

GitHub user pwoody opened a pull request:

    https://github.com/apache/spark/pull/19629


    [SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value 
calculation launches unnecessary stages

    ## What changes were proposed in this pull request?
    
    Adding a global limit on top of the distinct values before sorting and 
collecting will reduce the overall work in the case where we have more distinct 
values. We will also eagerly perform a collect rather than a take because we 
know we only have at most (maxValues + 1) rows.
    
    ## How was this patch tested?
    
    Existing tests cover sorted order

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pwoody/spark SPARK-22408

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19629.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19629
    
----
commit aa809e39baf222e698315a5efb2d583cab99aad7
Author: Patrick Woody <pwo...@palantir.com>
Date:   2017-11-01T15:44:51Z

    SPARK-22408: reduce work of calculating pivot distinct values

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19629: [SPARK-22408][SQL] RelationalGroupedDataset's dis...

Reply via email to