Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs: > The shuffle simply transfers the bytes its supposed to. Sparks shuffle of those bytes is not consistent in that the order it fetches from can change and without the sort happening on that data the order can be different on rerun. I guess maybe you mean the ShuffledRDD as a whole or do you mean something else here? By shuffle, I am referring to the output of shuffle which is be consumed by RDD with `ShuffleDependency` as input. More specifically, the output of `SparkEnv.get.shuffleManager.getReader(...).read()` which RDD (user and spark impl's) uses to fetch output of shuffle machinery. This output will not just be shuffle bytes/deserialize, but with aggregation applied (if specified) and ordering imposed (if specified). ShuffledRDD is one such usage within spark core, but others exist within spark core and in user code. > All I'm saying is zip is just another variant of this, you could document it as such and do nothing internal to spark to "fix it". I agree; repartition + shuffle, zip, sample, mllib usages are all variants of the same problem - of shuffle output order being inconsistent. > I guess we can separate out these 2 discussions. I think the point of this pr is to temporarily workaround the data loss/corruption issue with repartition by failing. So if everyone agrees on that lets move the discussion to a jira about what to do with the rest of the operators and fix repartition here. thoughts? Sounds good to me.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org