Hi,
Is there a nice or optimal method to randomly shuffle spark streaming RDDs?
Thanks,
Josh
I think the answer will be the same in streaming as in the core. You
want a random permutation of an RDD? in general RDDs don't have
ordering at all -- excepting when you sort for example -- so a
permutation doesn't make sense. Do you just want a well-defined but
random ordering of the data? Do
When I'm outputting the RDDs to an external source, I would like the RDDs
to be outputted in a random shuffle so that even the order is random. So
far what I understood is that the RDDs do have a type of order, in that the
order for spark streaming RDDs would be the order in which spark streaming
A use case would be helpful?
Batches of RDDs from Streams are going to have temporal ordering in terms of
when they are processed in a typical application ... , but maybe you could
shuffle the way batch iterations work
On Nov 3, 2014, at 11:59 AM, Josh J joshjd...@gmail.com wrote:
When
If you iterated over an RDD's partitions, I'm not sure that in
practice you would find the order matches the order they were
received. The receiver is replicating data to another node or node as
it goes and I don't know much is guaranteed about that.
If you want to permute an RDD, how about a
If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)
This sounds promising. Where can I read more about the space (memory and
network overhead) and time complexity of sortBy?
On