Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 Actually we can extend the solution later and I've mentioned it in my PR description. Basically there are 3 kinds of closures: 1. totally random 2. always output same data set in a random order 3. always output same data sequence (same order) Spark is able to handle closure 1, the cost is, whenever a fetch failure happens and a map task gets retried, Spark needs to rollback all the succeeding stages and retry them, because their input has changed. `zip` falls in this category, but due to time constraints, I think it's ok to document it and fix it later. For closure 2, Spark can treat it as closure 3 if the shuffle partitioner is order insensitive like range/hash partitioner. This means, when a map task gets retried, it will produce the same data for the reducers, so we don't need to rollback all the succeeding stages. However, if the shuffle partitioner is order insensitive like round-robin, Spark has to treat it like closure 1 and rollback all the succeeding stages if a map task gets retried. Closure 3 is already handled well by the current Spark. In this PR, I assume all the RDDs' computing functions are closure 3, so that we don't have performance regression. The only exception is shuffled RDD, which outputs data in a random order because of the remote block fetching. In the future, we can extend `RDD#isIdempotent` to an enum to indicate the 3 closure types, and change the `FetchFailed` handling logic accordingly.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org