Re: coalesce with shuffle or repartition is not necessarily fault-tolerant
Yes, I think this another operation that is not deterministic even for the same RDD. If a partition is lost and recalculated the ordering can be different in the partition. Sorting the RDD makes the ordering deterministic. On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Let's say you have some rows in a dataset (say X partitions initially). A B C D E . . . . You repartition to Y X, then it seems that any of the following could be valid: partition 1 partition 2 A B C E D . -- C E A B D . -- D B C E A etc. etc. I.e., although each partition will have the same unordered set, the rows' orders will change from call to call. Now, because row ordering can change from call to call, if you do any operation that depends on the order of items you saw, then lineage is no longer deterministic. For example, it seems that the repartition call itself is a row-order dependent call, because it creates a random number generator with the partition index as the seed, and then call nextInt as you go through the rows. On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell pwend...@gmail.com wrote: IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the ordering of an RDD. On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k' each time you call it. And because of this different ordering, 'h' will result in different partitions even, because 'repartition' distributes through a random number generator with the 'index' as the key. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: coalesce with shuffle or repartition is not necessarily fault-tolerant
Are there a large number of non-deterministic lineage operators? This seems like a pretty big caveat, particularly for casual programmers who expect consistent semantics between Spark and Scala. E.g., making sure that there's no randomness what-so-ever in RDD transformations seems critical. Additionally, shuffling operators would usually result in changed orders, etc. These are very easy errors to make, and if you tend to cache things, some errors won't be detected until fault-tolerance is triggered. It would be very helpful for programmers to have a big warning list of not-to-dos within RDD transformations. On Wed, Oct 8, 2014 at 11:57 PM, Sean Owen so...@cloudera.com wrote: Yes, I think this another operation that is not deterministic even for the same RDD. If a partition is lost and recalculated the ordering can be different in the partition. Sorting the RDD makes the ordering deterministic. On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Let's say you have some rows in a dataset (say X partitions initially). A B C D E . . . . You repartition to Y X, then it seems that any of the following could be valid: partition 1 partition 2 A B C E D . -- C E A B D . -- D B C E A etc. etc. I.e., although each partition will have the same unordered set, the rows' orders will change from call to call. Now, because row ordering can change from call to call, if you do any operation that depends on the order of items you saw, then lineage is no longer deterministic. For example, it seems that the repartition call itself is a row-order dependent call, because it creates a random number generator with the partition index as the seed, and then call nextInt as you go through the rows. On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell pwend...@gmail.com wrote: IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the ordering of an RDD. On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k' each time you call it. And because of this different ordering, 'h' will result in different partitions even, because 'repartition' distributes through a random number generator with the 'index' as the key.
Re: coalesce with shuffle or repartition is not necessarily fault-tolerant
+1 Eric Friedman On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Are there a large number of non-deterministic lineage operators? This seems like a pretty big caveat, particularly for casual programmers who expect consistent semantics between Spark and Scala. E.g., making sure that there's no randomness what-so-ever in RDD transformations seems critical. Additionally, shuffling operators would usually result in changed orders, etc. These are very easy errors to make, and if you tend to cache things, some errors won't be detected until fault-tolerance is triggered. It would be very helpful for programmers to have a big warning list of not-to-dos within RDD transformations. On Wed, Oct 8, 2014 at 11:57 PM, Sean Owen so...@cloudera.com wrote: Yes, I think this another operation that is not deterministic even for the same RDD. If a partition is lost and recalculated the ordering can be different in the partition. Sorting the RDD makes the ordering deterministic. On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Let's say you have some rows in a dataset (say X partitions initially). A B C D E . . . . You repartition to Y X, then it seems that any of the following could be valid: partition 1 partition 2 A B C E D . -- C E A B D . -- D B C E A etc. etc. I.e., although each partition will have the same unordered set, the rows' orders will change from call to call. Now, because row ordering can change from call to call, if you do any operation that depends on the order of items you saw, then lineage is no longer deterministic. For example, it seems that the repartition call itself is a row-order dependent call, because it creates a random number generator with the partition index as the seed, and then call nextInt as you go through the rows. On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell pwend...@gmail.com wrote: IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the ordering of an RDD. On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k' each time you call it. And because of this different ordering, 'h' will result in different partitions even, because 'repartition' distributes through a random number generator with the 'index' as the key.
coalesce with shuffle or repartition is not necessarily fault-tolerant
I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k' each time you call it. And because of this different ordering, 'h' will result in different partitions even, because 'repartition' distributes through a random number generator with the 'index' as the key.
Re: coalesce with shuffle or repartition is not necessarily fault-tolerant
IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the ordering of an RDD. On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k' each time you call it. And because of this different ordering, 'h' will result in different partitions even, because 'repartition' distributes through a random number generator with the 'index' as the key. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org