Let's say you have some rows in a dataset (say X partitions initially). A B C D E . . . .
You repartition to Y > X, then it seems that any of the following could be valid: partition 1 partition 2 ........................ A B ........................ C E D . ........................ -------------------------- C E A B D . -------------------------- D B C E A etc. etc. I.e., although each partition will have the same unordered set, the rows' orders will change from call to call. Now, because row ordering can change from call to call, if you do any operation that depends on the order of items you saw, then lineage is no longer deterministic. For example, it seems that the repartition call itself is a row-order dependent call, because it creates a random number generator with the partition index as the seed, and then call nextInt as you go through the rows. On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell <pwend...@gmail.com> wrote: > IIRC - the random is seeded with the index, so it will always produce > the same result for the same index. Maybe I don't totally follow > though. Could you give a small example of how this might change the > RDD ordering in a way that you don't expect? In general repartition() > will not preserve the ordering of an RDD. > > On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung > <coded...@cs.stanford.edu> wrote: > > I noticed that repartition will result in non-deterministic lineage > because > > it'll result in changed orders for rows. > > > > So for instance, if you do things like: > > > > val data = read(...) > > val k = data.repartition(5) > > val h = k.repartition(5) > > > > It seems that this results in different ordering of rows for 'k' each > time > > you call it. > > And because of this different ordering, 'h' will result in different > > partitions even, because 'repartition' distributes through a random > number > > generator with the 'index' as the key. >