@Jason I find interesting but I am not sure I understand it completely. I am assuming that the objective is to partition a dataset in two sub-sets where the element of each set is randomly selected from the first dataset.
If I read the code correctly what it does is: For each partition creates a Random Seed. for each element inside a partition generate a random number. for each ( value, randomNumber) create two filters, one for the elements that are less than the split values and the other for elements that are greater than the randomNumber. And because the random generator is uniform then if we provide a p:Double = 0.3 roughly 30% of the random numbers will fall under 0.3. Is this done on one node or on each node it will handle only the mapping on the partition it received ? I guess/hope the second but I want to make sure. Thank you -------------------------- Luck favors the prepared mind. (Pasteur)
