GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/1110

    [WIP][SPARK-2174][MLLIB] treeReduce and treeAggregate

    In `reduce` and `aggregate`, the driver node spends linear time on the 
number of partitions. It becomes a bottleneck when there are many partitions 
and the data from each partition is big.
    
    SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I 
did several implementations including butterfly, reduce + broadcast, and 
treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go 
for Spark. Using binary tree may introduce some overhead in communication, 
because the driver still need to coordinate on data shuffling. In my 
experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is 
why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing.
    
    I left `treeReduce` and `treeAggregate` public for easy testing.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark tree

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1110.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1110
    
----
commit fe42a5e8f5d002d22bd53a4cbcb81607efa10ab1
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-06-17T08:16:01Z

    add treeAggregate

commit eb71c330973fe3392a08882788553fcba28e7541
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-06-17T08:40:03Z

    add treeReduce

commit 0f944908cb4b5ce8b91456d103d913bfbf764687
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-06-17T08:52:20Z

    add docs

commit be6a88a9ddebb26111b2df339f8e2217eec73033
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-06-17T09:08:46Z

    use treeAggregate in mllib

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to