GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1110
[WIP][SPARK-2174][MLLIB] treeReduce and treeAggregate In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big. SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing. I left `treeReduce` and `treeAggregate` public for easy testing. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1110.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1110 ---- commit fe42a5e8f5d002d22bd53a4cbcb81607efa10ab1 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-06-17T08:16:01Z add treeAggregate commit eb71c330973fe3392a08882788553fcba28e7541 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-06-17T08:40:03Z add treeReduce commit 0f944908cb4b5ce8b91456d103d913bfbf764687 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-06-17T08:52:20Z add docs commit be6a88a9ddebb26111b2df339f8e2217eec73033 Author: Xiangrui Meng <m...@databricks.com> Date: 2014-06-17T09:08:46Z use treeAggregate in mllib ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---