Re: random shuffle streaming RDDs?

Sean Owen Mon, 03 Nov 2014 10:40:49 -0800

If you iterated over an RDD's partitions, I'm not sure that in
practice you would find the order matches the order they were
received. The receiver is replicating data to another node or node as
it goes and I don't know much is guaranteed about that.


If you want to permute an RDD, how about a sortBy() on a good hash
function of each value plus some salt? (Haven't thought this through
much but sounds about right.)

On Mon, Nov 3, 2014 at 4:59 PM, Josh J <joshjd...@gmail.com> wrote:
> When I'm outputting the RDDs to an external source, I would like the RDDs to
> be outputted in a random shuffle so that even the order is random. So far
> what I understood is that the RDDs do have a type of order, in that the
> order for spark streaming RDDs would be the order in which spark streaming
> read the tuples from source (e.g. ordered by roughly when the producer sent
> the tuple in addition to any latency)
>
> On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I think the answer will be the same in streaming as in the core. You
>> want a random permutation of an RDD? in general RDDs don't have
>> ordering at all -- excepting when you sort for example -- so a
>> permutation doesn't make sense. Do you just want a well-defined but
>> random ordering of the data? Do you just want to (re-)assign elements
>> randomly to partitions?
>>
>> On Mon, Nov 3, 2014 at 4:33 PM, Josh J <joshjd...@gmail.com> wrote:
>> > Hi,
>> >
>> > Is there a nice or optimal method to randomly shuffle spark streaming
>> > RDDs?
>> >
>> > Thanks,
>> > Josh
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: random shuffle streaming RDDs?

Reply via email to