GitHub user coderxiang opened a pull request: https://github.com/apache/incubator-spark/pull/635
allCollect functions for RDD Two methods (`allCollect`, `allCollectBroadcast`) are added to `RDD[T]`, which output a new `RDD[Array[T]]` instance with each partition containing all of the records of the original RDD stored in a single `Array[T]` instance (the same as RDD.collect). This functionality can be useful in machine learning tasks that require sharing updated parameters across partitions. Method `allCollect` creates a new `AllCollectedRDD` while method `allCollectBroadcast` applies broadcasting. Both of them need collecting the data and therefore should deliver similar performance. You can merge this pull request into a Git repository by running: $ git pull https://github.com/coderxiang/incubator-spark allCollect Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/635.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #635 ---- commit 74e7817d73bf635f9d27149d609b4da1c1e78f51 Author: lebesgue <lebes...@lebesgue.net> Date: 2014-02-21T03:46:30Z add a simple implementation of allCollect using AllCollectedRDD commit b619cbd2642502a80f53bf6f4b753301cb157956 Author: lebesgue <lebes...@lebesgue.net> Date: 2014-02-21T03:51:07Z code reorganization commit f727cd936dfad513238d8a127fdc15507a4025b0 Author: lebesgue <lebes...@lebesgue.net> Date: 2014-02-21T06:21:09Z add the implementation of allCollect using a broadcast variable ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---