>> If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)

This sounds promising. Where can I read more about the space (memory and
network overhead) and time complexity of sortBy?



On Mon, Nov 3, 2014 at 10:38 AM, Sean Owen <so...@cloudera.com> wrote:

> If you iterated over an RDD's partitions, I'm not sure that in
> practice you would find the order matches the order they were
> received. The receiver is replicating data to another node or node as
> it goes and I don't know much is guaranteed about that.
>
> If you want to permute an RDD, how about a sortBy() on a good hash
> function of each value plus some salt? (Haven't thought this through
> much but sounds about right.)
>
> On Mon, Nov 3, 2014 at 4:59 PM, Josh J <joshjd...@gmail.com> wrote:
> > When I'm outputting the RDDs to an external source, I would like the
> RDDs to
> > be outputted in a random shuffle so that even the order is random. So far
> > what I understood is that the RDDs do have a type of order, in that the
> > order for spark streaming RDDs would be the order in which spark
> streaming
> > read the tuples from source (e.g. ordered by roughly when the producer
> sent
> > the tuple in addition to any latency)
> >
> > On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> I think the answer will be the same in streaming as in the core. You
> >> want a random permutation of an RDD? in general RDDs don't have
> >> ordering at all -- excepting when you sort for example -- so a
> >> permutation doesn't make sense. Do you just want a well-defined but
> >> random ordering of the data? Do you just want to (re-)assign elements
> >> randomly to partitions?
> >>
> >> On Mon, Nov 3, 2014 at 4:33 PM, Josh J <joshjd...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Is there a nice or optimal method to randomly shuffle spark streaming
> >> > RDDs?
> >> >
> >> > Thanks,
> >> > Josh
> >
> >
>

Reply via email to