I believe your understanding is correct. And, yes, the processing should happen in parallel for each partition.
On Fri, Sep 13, 2013 at 10:56 AM, Fabrizio Milo aka misto < [email protected]> wrote: > @Jason > > I find interesting but I am not sure I understand it completely. > I am assuming that the objective is to partition a dataset in two > sub-sets where the element of each set is randomly selected from the > first dataset. > > If I read the code correctly what it does is: > > For each partition creates a Random Seed. > for each element inside a partition generate a random number. > for each ( value, randomNumber) > create two filters, one for the elements that are less than the split > values and the other for elements that are greater than the > randomNumber. > > And because the random generator is uniform then if we provide a p:Double > = 0.3 > roughly 30% of the random numbers will fall under 0.3. > > Is this done on one node or on each node it will handle only the > mapping on the partition it received ? I guess/hope the second but I > want to make sure. > > Thank you > -------------------------- > Luck favors the prepared mind. (Pasteur) >
