Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 I did not go over the PR itself in detail, but the proposal sounds very expensive - particularly given the cascading costs involved. Also, I am not sure why we are special case'ing only coalasce/repartition here : any closure which is depending on ordering of tuples is bound to fail - for example, RDD.zip* variants, sampling in ML etc will suffer from same issue. Either we fix shuffle itself to become deterministic (which I am not sure if we can do efficiently), or we could simply document this issue with coalasce/other relevant api - so that users do a sort when applicable : when they deem the additional cost is required to be borne. Note that in a lot of cases, this is not an issue - for example when reading from external data stores, checkpointed data, persisted data, etc : which typically are reasons why coalasce gets used a lot (to minimize number of partitions).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org