Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21698 OK we can treat it as a data loss. However, it's not caused by spark but by the user himself. If a user calls `zip` and then using a custom function to compute keys from the zipped pairs, and finally call `groupByKey`, there is nothing Spark can guarantee if the RDDs are unsorted. I think in this case the user should fix his business logic, Spark does nothing wrong on this. Even if the tasks never fail, the users can still get different result/cardinality if he runs his query multiple times. `repartition` is different because the user's business logic is nothing wrong: he just wants to repartition the data, Spark should not add/remove/update the existing records. Anyway if we do want to "fix" the `zip` problem, I think this should be a different topic: we would need to write all the input data to somewhere and make sure the retired task can get exactly same input, which is very expensive and very different from this approach.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org