subject:"Re\: coalesce with shuffle or repartition is not necessarily fault\-tolerant"

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sean Owen

Yes, I think this another operation that is not deterministic even for
the same RDD. If a partition is lost and recalculated the ordering can
be different in the partition. Sorting the RDD makes the ordering
deterministic.

On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
 Let's say you have some rows in a dataset (say X partitions initially).

 A
 B
 C
 D
 E
 .
 .
 .
 .


 You repartition to Y  X, then it seems that any of the following could be
 valid:

 partition 1 partition 2
 A  B
 
 C  E
 D   .
 
 --
 C  E
 A  B
 D  .
 --
 D  B
 C  E
 A

 etc. etc.

 I.e., although each partition will have the same unordered set, the rows'
 orders will change from call to call.

 Now, because row ordering can change from call to call, if you do any
 operation that depends on the order of items you saw, then lineage is no
 longer deterministic. For example, it seems that the repartition call itself
 is a row-order dependent call, because it creates a random number generator
 with the partition index as the seed, and then call nextInt as you go
 through the rows.


 On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell pwend...@gmail.com wrote:

 IIRC - the random is seeded with the index, so it will always produce
 the same result for the same index. Maybe I don't totally follow
 though. Could you give a small example of how this might change the
 RDD ordering in a way that you don't expect? In general repartition()
 will not preserve the ordering of an RDD.

 On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
 coded...@cs.stanford.edu wrote:
  I noticed that repartition will result in non-deterministic lineage
  because
  it'll result in changed orders for rows.
 
  So for instance, if you do things like:
 
  val data = read(...)
  val k = data.repartition(5)
  val h = k.repartition(5)
 
  It seems that this results in different ordering of rows for 'k' each
  time
  you call it.
  And because of this different ordering, 'h' will result in different
  partitions even, because 'repartition' distributes through a random
  number
  generator with the 'index' as the key.



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sung Hwan Chung

Are there a large number of non-deterministic lineage operators?

This seems like a pretty big caveat, particularly for casual programmers
who expect consistent semantics between Spark and Scala.

E.g., making sure that there's no randomness what-so-ever in RDD
transformations seems critical. Additionally, shuffling operators would
usually result in changed orders, etc.

These are very easy errors to make, and if you tend to cache things, some
errors won't be detected until fault-tolerance is triggered. It would be
very helpful for programmers to have a big warning list of not-to-dos
within RDD transformations.

On Wed, Oct 8, 2014 at 11:57 PM, Sean Owen so...@cloudera.com wrote:

 Yes, I think this another operation that is not deterministic even for
 the same RDD. If a partition is lost and recalculated the ordering can
 be different in the partition. Sorting the RDD makes the ordering
 deterministic.

 On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
 coded...@cs.stanford.edu wrote:
  Let's say you have some rows in a dataset (say X partitions initially).
 
  A
  B
  C
  D
  E
  .
  .
  .
  .
 
 
  You repartition to Y  X, then it seems that any of the following could
 be
  valid:
 
  partition 1 partition 2
 
  A  B
  
  C  E
  D   .
  
  --
  C  E
  A  B
  D  .
  --
  D  B
  C  E
  A
 
  etc. etc.
 
  I.e., although each partition will have the same unordered set, the rows'
  orders will change from call to call.
 
  Now, because row ordering can change from call to call, if you do any
  operation that depends on the order of items you saw, then lineage is no
  longer deterministic. For example, it seems that the repartition call
 itself
  is a row-order dependent call, because it creates a random number
 generator
  with the partition index as the seed, and then call nextInt as you go
  through the rows.
 
 
  On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  IIRC - the random is seeded with the index, so it will always produce
  the same result for the same index. Maybe I don't totally follow
  though. Could you give a small example of how this might change the
  RDD ordering in a way that you don't expect? In general repartition()
  will not preserve the ordering of an RDD.
 
  On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
  coded...@cs.stanford.edu wrote:
   I noticed that repartition will result in non-deterministic lineage
   because
   it'll result in changed orders for rows.
  
   So for instance, if you do things like:
  
   val data = read(...)
   val k = data.repartition(5)
   val h = k.repartition(5)
  
   It seems that this results in different ordering of rows for 'k' each
   time
   you call it.
   And because of this different ordering, 'h' will result in different
   partitions even, because 'repartition' distributes through a random
   number
   generator with the 'index' as the key.

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Eric Friedman

+1


Eric Friedman

 On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote:
 
 Are there a large number of non-deterministic lineage operators?
 
 This seems like a pretty big caveat, particularly for casual programmers who 
 expect consistent semantics between Spark and Scala.
 
 E.g., making sure that there's no randomness what-so-ever in RDD 
 transformations seems critical. Additionally, shuffling operators would 
 usually result in changed orders, etc.
 
 These are very easy errors to make, and if you tend to cache things, some 
 errors won't be detected until fault-tolerance is triggered. It would be very 
 helpful for programmers to have a big warning list of not-to-dos within RDD 
 transformations.
 
 On Wed, Oct 8, 2014 at 11:57 PM, Sean Owen so...@cloudera.com wrote:
 Yes, I think this another operation that is not deterministic even for
 the same RDD. If a partition is lost and recalculated the ordering can
 be different in the partition. Sorting the RDD makes the ordering
 deterministic.
 
 On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
 coded...@cs.stanford.edu wrote:
  Let's say you have some rows in a dataset (say X partitions initially).
 
  A
  B
  C
  D
  E
  .
  .
  .
  .
 
 
  You repartition to Y  X, then it seems that any of the following could be
  valid:
 
  partition 1 partition 2
  A  B
  
  C  E
  D   .
  
  --
  C  E
  A  B
  D  .
  --
  D  B
  C  E
  A
 
  etc. etc.
 
  I.e., although each partition will have the same unordered set, the rows'
  orders will change from call to call.
 
  Now, because row ordering can change from call to call, if you do any
  operation that depends on the order of items you saw, then lineage is no
  longer deterministic. For example, it seems that the repartition call 
  itself
  is a row-order dependent call, because it creates a random number generator
  with the partition index as the seed, and then call nextInt as you go
  through the rows.
 
 
  On Wed, Oct 8, 2014 at 10:14 PM, Patrick Wendell pwend...@gmail.com 
  wrote:
 
  IIRC - the random is seeded with the index, so it will always produce
  the same result for the same index. Maybe I don't totally follow
  though. Could you give a small example of how this might change the
  RDD ordering in a way that you don't expect? In general repartition()
  will not preserve the ordering of an RDD.
 
  On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
  coded...@cs.stanford.edu wrote:
   I noticed that repartition will result in non-deterministic lineage
   because
   it'll result in changed orders for rows.
  
   So for instance, if you do things like:
  
   val data = read(...)
   val k = data.repartition(5)
   val h = k.repartition(5)
  
   It seems that this results in different ordering of rows for 'k' each
   time
   you call it.
   And because of this different ordering, 'h' will result in different
   partitions even, because 'repartition' distributes through a random
   number
   generator with the 'index' as the key.

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Patrick Wendell

IIRC - the random is seeded with the index, so it will always produce
the same result for the same index. Maybe I don't totally follow
though. Could you give a small example of how this might change the
RDD ordering in a way that you don't expect? In general repartition()
will not preserve the ordering of an RDD.

On Wed, Oct 8, 2014 at 3:42 PM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
 I noticed that repartition will result in non-deterministic lineage because
 it'll result in changed orders for rows.

 So for instance, if you do things like:

 val data = read(...)
 val k = data.repartition(5)
 val h = k.repartition(5)

 It seems that this results in different ordering of rows for 'k' each time
 you call it.
 And because of this different ordering, 'h' will result in different
 partitions even, because 'repartition' distributes through a random number
 generator with the 'index' as the key.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

4 matches

Site Navigation

Mail list logo

Footer information