[GitHub] incubator-spark pull request: allCollect functions for RDD

coderxiang Sat, 22 Feb 2014 21:31:24 -0800

GitHub user coderxiang opened a pull request:

    https://github.com/apache/incubator-spark/pull/635


    allCollect functions for RDD

     Two methods (`allCollect`, `allCollectBroadcast`) are added to `RDD[T]`, 
which output a new `RDD[Array[T]]` instance with each partition containing all 
of the records of the original RDD stored in a single `Array[T]` instance (the 
same as RDD.collect). This functionality can be useful in machine learning 
tasks that require sharing updated parameters across partitions.
    
    Method `allCollect` creates a new `AllCollectedRDD` while method 
`allCollectBroadcast` applies broadcasting. Both of them need collecting the 
data and therefore should deliver similar performance.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/coderxiang/incubator-spark allCollect

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-spark/pull/635.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #635
    
----
commit 74e7817d73bf635f9d27149d609b4da1c1e78f51
Author: lebesgue <[email protected]>
Date:   2014-02-21T03:46:30Z

    add a simple implementation of allCollect using AllCollectedRDD

commit b619cbd2642502a80f53bf6f4b753301cb157956
Author: lebesgue <[email protected]>
Date:   2014-02-21T03:51:07Z

    code reorganization

commit f727cd936dfad513238d8a127fdc15507a4025b0
Author: lebesgue <[email protected]>
Date:   2014-02-21T06:21:09Z

    add the implementation of allCollect using a broadcast variable

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
[email protected] or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: allCollect functions for RDD

Reply via email to