Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sean Owen
Yes, I think this another operation that is not deterministic even for the same RDD. If a partition is lost and recalculated the ordering can be different in the partition. Sorting the RDD makes the ordering deterministic. On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung coded...@cs.stanford.edu

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sung Hwan Chung
Are there a large number of non-deterministic lineage operators? This seems like a pretty big caveat, particularly for casual programmers who expect consistent semantics between Spark and Scala. E.g., making sure that there's no randomness what-so-ever in RDD transformations seems critical.

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Eric Friedman
+1 Eric Friedman On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Are there a large number of non-deterministic lineage operators? This seems like a pretty big caveat, particularly for casual programmers who expect consistent semantics between Spark and

coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Sung Hwan Chung
I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k'

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Patrick Wendell
IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the