I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows.
So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k' each time you call it. And because of this different ordering, 'h' will result in different partitions even, because 'repartition' distributes through a random number generator with the 'index' as the key.