You're welcome. Be sure to use the second version I posted as the first version is problematic and could result in a bad (non-random) split under some circumstances.
On Fri, Sep 13, 2013 at 8:31 AM, Hao REN <[email protected]> wrote: > @Jason : Thank you for your code. It works fine. > > @Mark : It's good to know about random number generation. Thanks for the > advice. > > Still a question : > > As subtract can be replaced by the Jason's code, what is the use case of > subtract, knowing that it is not a good way to partition data ? > > Thank you. > > Hao > > > On Fri, Sep 13, 2013 at 3:33 AM, Jason Lenderman <[email protected]>wrote: > >> >> Yeah, I realized shortly after I sent that message that my use of map in >> that code was problematic. This is probably a bit better: >> >> >> def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long = >> System.currentTimeMillis): (RDD[T], RDD[T]) = { >> val rand = new java.util.Random(seed) >> val partitionSeeds = data.partitions.map(partition => rand.nextLong) >> val temp = data.mapPartitionsWithIndex((index, iter) => { >> val partitionRand = new java.util.Random(partitionSeeds(index)) >> iter.map(x => (x, partitionRand.nextDouble)) >> >> }) >> (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1)) >> } >> >> >> >> > > > -- > REN Hao > > Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL) > > Computer Science > > Etudiant à l'Université de Technologie de Compiègne (UTC) > > Génie Informatique - Fouille de Données > > Tel: +33 06 14 54 57 24 / +41 07 86 47 52 69 >
