GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/22112

    [WIP][SPARK-23243][Core] Fix RDD.repartition() data correctness issue

    ## What changes were proposed in this pull request?
    
    An alternative fix for https://github.com/apache/spark/pull/21698
    
    RDD can take arbitrary user function, but we have an assumption: the 
function should produce same data set for same input, but the order can change.
    
    Spark scheduler must take care of this assumption when fetch failure 
happens, otherwise we may hit correctness issue as the JIRA ticket described.
    
    Generall speaking, when a map stage gets retried because of fetch failure, 
and this map stage is not idempotent(produce same data set but different order 
each time), and the shuffle partitioner is sensitive to the input data 
order(like round robin partitioner), we should retry all the reduce tasks.
    
    TODO: document and test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark repartition

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22112
    
----
commit 1f9f6e5b020038be1e7c11b9923010465da385aa
Author: Wenchen Fan <wenchen@...>
Date:   2018-08-15T18:38:24Z

    fix repartition+shuffle bug

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to