GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/2963
add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ... ...need to process values in a particular order see: https://issues.apache.org/jira/browse/SPARK-3655 this is the second of 2 competing pullreqs that try to address this issue. this one does so without making changes to core spark sorting routines. it is based on this suggestion by patrick wendell: 1. Map your RDD[(K, V)] to an RDD[((K, V), null)] 2. Write a custom partitioner that partitions based only on the K component of the key. 3. Call repartitionAndSortWithinPartition with your custom partitioner 4. Map the RDD back into RDD[(K, V)] the downsides of this approach are that 1) a little more data goes through the shuffle (one extra object per row), i am not sure if this matters at all 2) the sorting by value is not generalized the upside is that it's a much simpler and more self-contained change than the other pullreq. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-foldleft-pullreq Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2963.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2963 ---- commit 4aa5acf7bb631c7e68190055a445a890a822a0ae Author: Koert Kuipers <ko...@tresata.com> Date: 2014-10-27T20:33:47Z add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key need to process values in a particular order ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org