Re: RDD.subtract doesn't work

Fabrizio Milo aka misto Fri, 13 Sep 2013 10:58:06 -0700

@Jason

I find interesting but I am not sure I understand it completely.
I am assuming that the objective is to partition a dataset in two
sub-sets where the element of each set is randomly selected from the
first dataset.


If I read the code correctly what it does is:

For each partition creates a Random Seed.
for each element inside a partition generate a random number.
for each ( value, randomNumber)
create two filters, one for the elements that are less than the split
values and the other for elements that are greater than the
randomNumber.

And because the random generator is uniform then if we provide a p:Double = 0.3
roughly 30% of the random numbers will fall under 0.3.

Is this done on one node or on each node it will handle only the
mapping on the partition it received ? I guess/hope the second but I
want to make sure.

Thank you
--------------------------
Luck favors the prepared mind. (Pasteur)

Re: RDD.subtract doesn't work

Reply via email to