GitHub user koertkuipers opened a pull request:

    https://github.com/apache/spark/pull/2963

    add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ...

    ...need to process values in a particular order
    
    see:
    https://issues.apache.org/jira/browse/SPARK-3655
    
    this is the second of 2 competing pullreqs that try to address this issue. 
this one does so without making changes to core spark sorting routines. it is 
based on this suggestion by patrick wendell:
    1. Map your RDD[(K, V)] to an RDD[((K, V), null)]
    2. Write a custom partitioner that partitions based only on the K component 
of the key.
    3. Call repartitionAndSortWithinPartition with your custom partitioner
    4. Map the RDD back into RDD[(K, V)]
    
    the downsides of this approach are that 
    1) a little more data goes through the shuffle (one extra object per row), 
i am not sure if this matters at all
    2) the sorting by value is not generalized
    
    the upside is that it's a much simpler and more self-contained change than 
the other pullreq.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tresata/spark feat-foldleft-pullreq

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2963
    
----
commit 4aa5acf7bb631c7e68190055a445a890a822a0ae
Author: Koert Kuipers <ko...@tresata.com>
Date:   2014-10-27T20:33:47Z

    add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key 
need to process values in a particular order

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to