Hi Ted,

I am using Spark 1.5.2

Without repartition in the picture, it works exactly as it's supposed to.
With repartition, I am guessing when we call takeOrdered on train, it goes
ahead and compute the rdd, which has repartitioning on it, and prints out
the numbers. With the next call to takeOrdered on test, it again computes
the rdd and again repartitions the data. Since repartitioning is not
guaranteed to produce the same result again, we see different numbers
because the rdd is effectively different now.

Moreover, if we cache the rdd after repartitioning it, both train and test
produces consistent results.



Best Regards,
Gaurav Kumar
Big Data • Data Science • Photography • Music
+91 9953294125

On Mon, Dec 28, 2015 at 3:04 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. the train and test have overlap in the numbers being outputted
>
> Can the call to repartition explain the above ?
>
> Which release of Spark are you using ?
>
> Thanks
>
> On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar <gauravkuma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I noticed an inconsistent behavior when using rdd.randomSplit when the
>> source rdd is repartitioned, but only in YARN mode. It works fine in local
>> mode though.
>>
>> *Code:*
>> val rdd = sc.parallelize(1 to 1000000)
>> val rdd2 = rdd.repartition(64)
>> rdd.partitions.size
>> rdd2.partitions.size
>> val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1)
>> train.takeOrdered(10)
>> test.takeOrdered(10)
>>
>> *Master: local*
>> Both the take statements produce consistent results and have no overlap
>> in numbers being outputted.
>>
>>
>> *Master: YARN*However, when these are run on YARN mode, these produce
>> random results every time and also the train and test have overlap in the
>> numbers being outputted.
>> If I use *rdd*.randomSplit, then it works fine even on YARN.
>>
>> So, it concludes that the repartition is being evaluated every time the
>> splitting occurs.
>>
>> Interestingly, if I cache the rdd2 before splitting it, then we can
>> expect consistent behavior since repartition is not evaluated again and
>> again.
>>
>> Best Regards,
>> Gaurav Kumar
>> Big Data • Data Science • Photography • Music
>> +91 9953294125
>>
>
>

Reply via email to