Yes, I think this another operation that is not deterministic even for
the same RDD. If a partition is lost and recalculated the ordering can
be different in the partition. Sorting the RDD makes the ordering
deterministic.
On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
coded...@cs.stanford.edu
Are there a large number of non-deterministic lineage operators?
This seems like a pretty big caveat, particularly for casual programmers
who expect consistent semantics between Spark and Scala.
E.g., making sure that there's no randomness what-so-ever in RDD
transformations seems critical.
+1
Eric Friedman
On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote:
Are there a large number of non-deterministic lineage operators?
This seems like a pretty big caveat, particularly for casual programmers who
expect consistent semantics between Spark and
I noticed that repartition will result in non-deterministic lineage because
it'll result in changed orders for rows.
So for instance, if you do things like:
val data = read(...)
val k = data.repartition(5)
val h = k.repartition(5)
It seems that this results in different ordering of rows for 'k'
IIRC - the random is seeded with the index, so it will always produce
the same result for the same index. Maybe I don't totally follow
though. Could you give a small example of how this might change the
RDD ordering in a way that you don't expect? In general repartition()
will not preserve the