Re: RDD.subtract doesn't work

Jason Lenderman Fri, 13 Sep 2013 09:51:29 -0700

You're welcome. Be sure to use the second version I posted as the first
version is problematic and could result in a bad (non-random) split under
some circumstances.



On Fri, Sep 13, 2013 at 8:31 AM, Hao REN <[email protected]> wrote:

> @Jason : Thank you for your code. It works fine.
>
> @Mark : It's good to know about random number generation. Thanks for the
> advice.
>
> Still a question :
>
> As subtract can be replaced by the Jason's code, what is the use case of
> subtract, knowing that it is not a good way to partition data ?
>
> Thank you.
>
> Hao
>
>
> On Fri, Sep 13, 2013 at 3:33 AM, Jason Lenderman <[email protected]>wrote:
>
>>
>> Yeah, I realized shortly after I sent that message that my use of map in
>> that code was problematic. This is probably a bit better:
>>
>>
>>   def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
>> System.currentTimeMillis): (RDD[T], RDD[T]) = {
>>     val rand = new java.util.Random(seed)
>>     val partitionSeeds = data.partitions.map(partition => rand.nextLong)
>>     val temp = data.mapPartitionsWithIndex((index, iter) => {
>>       val partitionRand = new java.util.Random(partitionSeeds(index))
>>       iter.map(x => (x, partitionRand.nextDouble))
>>
>>     })
>>     (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
>>   }
>>
>>
>>
>>
>
>
> --
> REN Hao
>
> Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL)
>
> Computer Science
>
> Etudiant à l'Université de Technologie de Compiègne (UTC)
>
> Génie Informatique - Fouille de Données
>
> Tel:  +33 06 14 54 57 24  ／  +41 07 86 47 52 69
>

Re: RDD.subtract doesn't work

Reply via email to