>> If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.)
This sounds promising. Where can I read more about the space (memory and network overhead) and time complexity of sortBy? On Mon, Nov 3, 2014 at 10:38 AM, Sean Owen <so...@cloudera.com> wrote: > If you iterated over an RDD's partitions, I'm not sure that in > practice you would find the order matches the order they were > received. The receiver is replicating data to another node or node as > it goes and I don't know much is guaranteed about that. > > If you want to permute an RDD, how about a sortBy() on a good hash > function of each value plus some salt? (Haven't thought this through > much but sounds about right.) > > On Mon, Nov 3, 2014 at 4:59 PM, Josh J <joshjd...@gmail.com> wrote: > > When I'm outputting the RDDs to an external source, I would like the > RDDs to > > be outputted in a random shuffle so that even the order is random. So far > > what I understood is that the RDDs do have a type of order, in that the > > order for spark streaming RDDs would be the order in which spark > streaming > > read the tuples from source (e.g. ordered by roughly when the producer > sent > > the tuple in addition to any latency) > > > > On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen <so...@cloudera.com> wrote: > >> > >> I think the answer will be the same in streaming as in the core. You > >> want a random permutation of an RDD? in general RDDs don't have > >> ordering at all -- excepting when you sort for example -- so a > >> permutation doesn't make sense. Do you just want a well-defined but > >> random ordering of the data? Do you just want to (re-)assign elements > >> randomly to partitions? > >> > >> On Mon, Nov 3, 2014 at 4:33 PM, Josh J <joshjd...@gmail.com> wrote: > >> > Hi, > >> > > >> > Is there a nice or optimal method to randomly shuffle spark streaming > >> > RDDs? > >> > > >> > Thanks, > >> > Josh > > > > >