Re: random shuffle streaming RDDs?

2014-11-03 Thread Sean Owen
I think the answer will be the same in streaming as in the core. You
want a random permutation of an RDD? in general RDDs don't have
ordering at all -- excepting when you sort for example -- so a
permutation doesn't make sense. Do you just want a well-defined but
random ordering of the data? Do you just want to (re-)assign elements
randomly to partitions?

On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote:
 Hi,

 Is there a nice or optimal method to randomly shuffle spark streaming RDDs?

 Thanks,
 Josh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
When I'm outputting the RDDs to an external source, I would like the RDDs
to be outputted in a random shuffle so that even the order is random. So
far what I understood is that the RDDs do have a type of order, in that the
order for spark streaming RDDs would be the order in which spark streaming
read the tuples from source (e.g. ordered by roughly when the producer sent
the tuple in addition to any latency)

On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote:

 I think the answer will be the same in streaming as in the core. You
 want a random permutation of an RDD? in general RDDs don't have
 ordering at all -- excepting when you sort for example -- so a
 permutation doesn't make sense. Do you just want a well-defined but
 random ordering of the data? Do you just want to (re-)assign elements
 randomly to partitions?

 On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote:
  Hi,
 
  Is there a nice or optimal method to randomly shuffle spark streaming
 RDDs?
 
  Thanks,
  Josh



Re: random shuffle streaming RDDs?

2014-11-03 Thread Jay Vyas
A use case would be helpful? 

Batches of  RDDs from Streams are going to have temporal ordering in terms of 
when they are processed in a typical application ... , but maybe you could 
shuffle the way batch iterations work

 On Nov 3, 2014, at 11:59 AM, Josh J joshjd...@gmail.com wrote:
 
 When I'm outputting the RDDs to an external source, I would like the RDDs to 
 be outputted in a random shuffle so that even the order is random. So far 
 what I understood is that the RDDs do have a type of order, in that the order 
 for spark streaming RDDs would be the order in which spark streaming read the 
 tuples from source (e.g. ordered by roughly when the producer sent the tuple 
 in addition to any latency)
 
 On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote:
 I think the answer will be the same in streaming as in the core. You
 want a random permutation of an RDD? in general RDDs don't have
 ordering at all -- excepting when you sort for example -- so a
 permutation doesn't make sense. Do you just want a well-defined but
 random ordering of the data? Do you just want to (re-)assign elements
 randomly to partitions?
 
 On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote:
  Hi,
 
  Is there a nice or optimal method to randomly shuffle spark streaming RDDs?
 
  Thanks,
  Josh
 


Re: random shuffle streaming RDDs?

2014-11-03 Thread Sean Owen
If you iterated over an RDD's partitions, I'm not sure that in
practice you would find the order matches the order they were
received. The receiver is replicating data to another node or node as
it goes and I don't know much is guaranteed about that.

If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)

On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote:
 When I'm outputting the RDDs to an external source, I would like the RDDs to
 be outputted in a random shuffle so that even the order is random. So far
 what I understood is that the RDDs do have a type of order, in that the
 order for spark streaming RDDs would be the order in which spark streaming
 read the tuples from source (e.g. ordered by roughly when the producer sent
 the tuple in addition to any latency)

 On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote:

 I think the answer will be the same in streaming as in the core. You
 want a random permutation of an RDD? in general RDDs don't have
 ordering at all -- excepting when you sort for example -- so a
 permutation doesn't make sense. Do you just want a well-defined but
 random ordering of the data? Do you just want to (re-)assign elements
 randomly to partitions?

 On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote:
  Hi,
 
  Is there a nice or optimal method to randomly shuffle spark streaming
  RDDs?
 
  Thanks,
  Josh



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
 If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)

This sounds promising. Where can I read more about the space (memory and
network overhead) and time complexity of sortBy?



On Mon, Nov 3, 2014 at 10:38 AM, Sean Owen so...@cloudera.com wrote:

 If you iterated over an RDD's partitions, I'm not sure that in
 practice you would find the order matches the order they were
 received. The receiver is replicating data to another node or node as
 it goes and I don't know much is guaranteed about that.

 If you want to permute an RDD, how about a sortBy() on a good hash
 function of each value plus some salt? (Haven't thought this through
 much but sounds about right.)

 On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote:
  When I'm outputting the RDDs to an external source, I would like the
 RDDs to
  be outputted in a random shuffle so that even the order is random. So far
  what I understood is that the RDDs do have a type of order, in that the
  order for spark streaming RDDs would be the order in which spark
 streaming
  read the tuples from source (e.g. ordered by roughly when the producer
 sent
  the tuple in addition to any latency)
 
  On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote:
 
  I think the answer will be the same in streaming as in the core. You
  want a random permutation of an RDD? in general RDDs don't have
  ordering at all -- excepting when you sort for example -- so a
  permutation doesn't make sense. Do you just want a well-defined but
  random ordering of the data? Do you just want to (re-)assign elements
  randomly to partitions?
 
  On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote:
   Hi,
  
   Is there a nice or optimal method to randomly shuffle spark streaming
   RDDs?
  
   Thanks,
   Josh