[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1110#issuecomment-47683286 We benchmarked treeReduce in our random forest implementation, and since the trees generated from each partition are fairly large (more than 100MB), we found that treeReduce can significantly reduce the shuffle time from 6mins to 2mins. Nice work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1110#issuecomment-47686100 @dbtsai Thanks for testing it! I'm going to move `treeReduce` and `treeAggregate` to `mllib.rdd.RDDFunctions`. For normal data processing, people generally use more partitions than number of cores. In those cases, the driver can collect task result while other tasks are running. This is not the optimal case for machine learning algorithms. So I think we can keep `treeReduce` and `treeAggregate` in mllib for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1110 [WIP][SPARK-2174][MLLIB] treeReduce and treeAggregate In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big. SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n - sqrt(n) - 1 gives the best performance in general, which is why I set depth = 2 in MLlib algorithms. But it certainly needs more testing. I left `treeReduce` and `treeAggregate` public for easy testing. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1110.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1110 commit fe42a5e8f5d002d22bd53a4cbcb81607efa10ab1 Author: Xiangrui Meng m...@databricks.com Date: 2014-06-17T08:16:01Z add treeAggregate commit eb71c330973fe3392a08882788553fcba28e7541 Author: Xiangrui Meng m...@databricks.com Date: 2014-06-17T08:40:03Z add treeReduce commit 0f944908cb4b5ce8b91456d103d913bfbf764687 Author: Xiangrui Meng m...@databricks.com Date: 2014-06-17T08:52:20Z add docs commit be6a88a9ddebb26111b2df339f8e2217eec73033 Author: Xiangrui Meng m...@databricks.com Date: 2014-06-17T09:08:46Z use treeAggregate in mllib --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1110#issuecomment-46380502 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1110#issuecomment-46380509 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1110#issuecomment-46382961 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2174][MLLIB] treeReduce and treeAg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1110#issuecomment-46382962 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15860/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---