[GitHub] spark pull request #19763: [SPARK-22537] Aggregation of map output statistic...

gczsjdy Wed, 15 Nov 2017 20:22:31 -0800

GitHub user gczsjdy opened a pull request:

    https://github.com/apache/spark/pull/19763


    [SPARK-22537] Aggregation of map output statistics on driver faces single 
point bottleneck

    ## What changes were proposed in this pull request?
    
    In adaptive execution, the map output statistics of all mappers will be 
aggregated after previous stage is successfully executed. Driver takes the 
aggregation job while it will get slow when the number of `mapper * shuffle 
partitions` is large, since it only uses single thread to compute. This PR uses 
multi-thread to deal with this single point bottleneck.
    
    ## How was this patch tested?
    
    Test cases are in `MapOutputTrackerSuite.scala`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gczsjdy/spark single_point_mapstatistics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19763.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19763
    
----
commit 5dd04872e983de861a301c22a124dd8923ccc8c6
Author: GuoChenzhao <chenzhao....@intel.com>
Date:   2017-11-16T02:58:22Z

    Use multi-thread to solve single point bottleneck

commit 819774fc7087c51a4b7b03213bfb330331d6f108
Author: GuoChenzhao <chenzhao....@intel.com>
Date:   2017-11-16T03:01:21Z

    Add test case

commit da028258bd172b6d3ff89504097fb6651f5c05c0
Author: GuoChenzhao <chenzhao....@intel.com>
Date:   2017-11-16T03:24:47Z

    Style

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19763: [SPARK-22537] Aggregation of map output statistic...

Reply via email to