@Jason : Thank you for your code. It works fine. @Mark : It's good to know about random number generation. Thanks for the advice.
Still a question : As subtract can be replaced by the Jason's code, what is the use case of subtract, knowing that it is not a good way to partition data ? Thank you. Hao On Fri, Sep 13, 2013 at 3:33 AM, Jason Lenderman <[email protected]>wrote: > > Yeah, I realized shortly after I sent that message that my use of map in > that code was problematic. This is probably a bit better: > > > def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long = > System.currentTimeMillis): (RDD[T], RDD[T]) = { > val rand = new java.util.Random(seed) > val partitionSeeds = data.partitions.map(partition => rand.nextLong) > val temp = data.mapPartitionsWithIndex((index, iter) => { > val partitionRand = new java.util.Random(partitionSeeds(index)) > iter.map(x => (x, partitionRand.nextDouble)) > > }) > (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1)) > } > > > > -- REN Hao Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL) Computer Science Etudiant à l'Université de Technologie de Compiègne (UTC) Génie Informatique - Fouille de Données Tel: +33 06 14 54 57 24 / +41 07 86 47 52 69
