GitHub user gczsjdy opened a pull request: https://github.com/apache/spark/pull/19763
[SPARK-22537] Aggregation of map output statistics on driver faces single point bottleneck ## What changes were proposed in this pull request? In adaptive execution, the map output statistics of all mappers will be aggregated after previous stage is successfully executed. Driver takes the aggregation job while it will get slow when the number of `mapper * shuffle partitions` is large, since it only uses single thread to compute. This PR uses multi-thread to deal with this single point bottleneck. ## How was this patch tested? Test cases are in `MapOutputTrackerSuite.scala` You can merge this pull request into a Git repository by running: $ git pull https://github.com/gczsjdy/spark single_point_mapstatistics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19763.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19763 ---- commit 5dd04872e983de861a301c22a124dd8923ccc8c6 Author: GuoChenzhao <chenzhao....@intel.com> Date: 2017-11-16T02:58:22Z Use multi-thread to solve single point bottleneck commit 819774fc7087c51a4b7b03213bfb330331d6f108 Author: GuoChenzhao <chenzhao....@intel.com> Date: 2017-11-16T03:01:21Z Add test case commit da028258bd172b6d3ff89504097fb6651f5c05c0 Author: GuoChenzhao <chenzhao....@intel.com> Date: 2017-11-16T03:24:47Z Style ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org