[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-07-01 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47683286
  
We benchmarked treeReduce in our random forest implementation, and since 
the trees generated from each partition are fairly large (more than 100MB), we 
found that treeReduce can significantly reduce the shuffle time from 6mins to 
2mins. Nice work! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-07-01 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-47686100
  
@dbtsai Thanks for testing it! I'm going to move `treeReduce` and 
`treeAggregate` to `mllib.rdd.RDDFunctions`. For normal data processing, people 
generally use more partitions than number of cores. In those cases, the driver 
can collect task result while other tasks are running. This is not the optimal 
case for machine learning algorithms. So I think we can keep `treeReduce` and 
`treeAggregate` in mllib for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-06-17 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/1110

[WIP][SPARK-2174][MLLIB] treeReduce and treeAggregate

In `reduce` and `aggregate`, the driver node spends linear time on the 
number of partitions. It becomes a bottleneck when there are many partitions 
and the data from each partition is big.

SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I 
did several implementations including butterfly, reduce + broadcast, and 
treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go 
for Spark. Using binary tree may introduce some overhead in communication, 
because the driver still need to coordinate on data shuffling. In my 
experiments, n - sqrt(n) - 1 gives the best performance in general, which is 
why I set depth = 2 in MLlib algorithms. But it certainly needs more testing.

I left `treeReduce` and `treeAggregate` public for easy testing.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark tree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1110.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1110


commit fe42a5e8f5d002d22bd53a4cbcb81607efa10ab1
Author: Xiangrui Meng m...@databricks.com
Date:   2014-06-17T08:16:01Z

add treeAggregate

commit eb71c330973fe3392a08882788553fcba28e7541
Author: Xiangrui Meng m...@databricks.com
Date:   2014-06-17T08:40:03Z

add treeReduce

commit 0f944908cb4b5ce8b91456d103d913bfbf764687
Author: Xiangrui Meng m...@databricks.com
Date:   2014-06-17T08:52:20Z

add docs

commit be6a88a9ddebb26111b2df339f8e2217eec73033
Author: Xiangrui Meng m...@databricks.com
Date:   2014-06-17T09:08:46Z

use treeAggregate in mllib




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-46380502
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-46380509
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-46382961
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...

2014-06-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1110#issuecomment-46382962
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15860/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---