[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Thu, 16 Aug 2018 20:22:33 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    Actually we can extend the solution later and I've mentioned it in my PR 
description.
    
    Basically there are 3 kinds of closures:
    1. totally random
    2. always output same data set in a random order
    3. always output same data sequence (same order)
    
    Spark is able to handle closure 1, the cost is, whenever a fetch failure 
happens and a map task gets retried, Spark needs to rollback all the succeeding 
stages and retry them, because their input has changed. `zip` falls in this 
category, but due to time constraints, I think it's ok to document it and fix 
it later.
    
    For closure 2, Spark can treat it as closure 3 if the shuffle partitioner 
is order insensitive like range/hash partitioner. This means, when a map task 
gets retried, it will produce the same data for the reducers, so we don't need 
to rollback all the succeeding stages. However, if the shuffle partitioner is 
order insensitive like round-robin, Spark has to treat it like closure 1 and 
rollback all the succeeding stages if a map task gets retried.
    
    Closure 3 is already handled well by the current Spark.
    
    In this PR, I assume all the RDDs' computing functions are closure 3, so 
that we don't have performance regression. The only exception is shuffled RDD, 
which outputs data in a random order because of the remote block fetching.
    
    In the future, we can extend `RDD#isIdempotent` to an enum to indicate the 
3 closure types, and change the `FetchFailed` handling logic accordingly.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to