Re: random shuffle streaming RDDs?
I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote: Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: random shuffle streaming RDDs?
When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle so that even the order is random. So far what I understood is that the RDDs do have a type of order, in that the order for spark streaming RDDs would be the order in which spark streaming read the tuples from source (e.g. ordered by roughly when the producer sent the tuple in addition to any latency) On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote: I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote: Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh
Re: random shuffle streaming RDDs?
A use case would be helpful? Batches of RDDs from Streams are going to have temporal ordering in terms of when they are processed in a typical application ... , but maybe you could shuffle the way batch iterations work On Nov 3, 2014, at 11:59 AM, Josh J joshjd...@gmail.com wrote: When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle so that even the order is random. So far what I understood is that the RDDs do have a type of order, in that the order for spark streaming RDDs would be the order in which spark streaming read the tuples from source (e.g. ordered by roughly when the producer sent the tuple in addition to any latency) On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote: I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote: Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh
Re: random shuffle streaming RDDs?
If you iterated over an RDD's partitions, I'm not sure that in practice you would find the order matches the order they were received. The receiver is replicating data to another node or node as it goes and I don't know much is guaranteed about that. If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote: When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle so that even the order is random. So far what I understood is that the RDDs do have a type of order, in that the order for spark streaming RDDs would be the order in which spark streaming read the tuples from source (e.g. ordered by roughly when the producer sent the tuple in addition to any latency) On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote: I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote: Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: random shuffle streaming RDDs?
If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) This sounds promising. Where can I read more about the space (memory and network overhead) and time complexity of sortBy? On Mon, Nov 3, 2014 at 10:38 AM, Sean Owen so...@cloudera.com wrote: If you iterated over an RDD's partitions, I'm not sure that in practice you would find the order matches the order they were received. The receiver is replicating data to another node or node as it goes and I don't know much is guaranteed about that. If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote: When I'm outputting the RDDs to an external source, I would like the RDDs to be outputted in a random shuffle so that even the order is random. So far what I understood is that the RDDs do have a type of order, in that the order for spark streaming RDDs would be the order in which spark streaming read the tuples from source (e.g. ordered by roughly when the producer sent the tuple in addition to any latency) On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen so...@cloudera.com wrote: I think the answer will be the same in streaming as in the core. You want a random permutation of an RDD? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd...@gmail.com wrote: Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh