GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22112
[WIP][SPARK-23243][Core] Fix RDD.repartition() data correctness issue ## What changes were proposed in this pull request? An alternative fix for https://github.com/apache/spark/pull/21698 RDD can take arbitrary user function, but we have an assumption: the function should produce same data set for same input, but the order can change. Spark scheduler must take care of this assumption when fetch failure happens, otherwise we may hit correctness issue as the JIRA ticket described. Generall speaking, when a map stage gets retried because of fetch failure, and this map stage is not idempotent(produce same data set but different order each time), and the shuffle partitioner is sensitive to the input data order(like round robin partitioner), we should retry all the reduce tasks. TODO: document and test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark repartition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22112.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22112 ---- commit 1f9f6e5b020038be1e7c11b9923010465da385aa Author: Wenchen Fan <wenchen@...> Date: 2018-08-15T18:38:24Z fix repartition+shuffle bug ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org