Re: Inconsistent behavior of randomSplit in YARN mode

2015-12-28 Thread Gaurav Kumar
Hi Ted,

I am using Spark 1.5.2

Without repartition in the picture, it works exactly as it's supposed to.
With repartition, I am guessing when we call takeOrdered on train, it goes
ahead and compute the rdd, which has repartitioning on it, and prints out
the numbers. With the next call to takeOrdered on test, it again computes
the rdd and again repartitions the data. Since repartitioning is not
guaranteed to produce the same result again, we see different numbers
because the rdd is effectively different now.

Moreover, if we cache the rdd after repartitioning it, both train and test
produces consistent results.



Best Regards,
Gaurav Kumar
Big Data • Data Science • Photography • Music
+91 9953294125

On Mon, Dec 28, 2015 at 3:04 PM, Ted Yu  wrote:

> bq. the train and test have overlap in the numbers being outputted
>
> Can the call to repartition explain the above ?
>
> Which release of Spark are you using ?
>
> Thanks
>
> On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar 
> wrote:
>
>> Hi,
>>
>> I noticed an inconsistent behavior when using rdd.randomSplit when the
>> source rdd is repartitioned, but only in YARN mode. It works fine in local
>> mode though.
>>
>> *Code:*
>> val rdd = sc.parallelize(1 to 100)
>> val rdd2 = rdd.repartition(64)
>> rdd.partitions.size
>> rdd2.partitions.size
>> val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1)
>> train.takeOrdered(10)
>> test.takeOrdered(10)
>>
>> *Master: local*
>> Both the take statements produce consistent results and have no overlap
>> in numbers being outputted.
>>
>>
>> *Master: YARN*However, when these are run on YARN mode, these produce
>> random results every time and also the train and test have overlap in the
>> numbers being outputted.
>> If I use *rdd*.randomSplit, then it works fine even on YARN.
>>
>> So, it concludes that the repartition is being evaluated every time the
>> splitting occurs.
>>
>> Interestingly, if I cache the rdd2 before splitting it, then we can
>> expect consistent behavior since repartition is not evaluated again and
>> again.
>>
>> Best Regards,
>> Gaurav Kumar
>> Big Data • Data Science • Photography • Music
>> +91 9953294125
>>
>
>


Re: Inconsistent behavior of randomSplit in YARN mode

2015-12-28 Thread Ted Yu
bq. the train and test have overlap in the numbers being outputted

Can the call to repartition explain the above ?

Which release of Spark are you using ?

Thanks

On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar 
wrote:

> Hi,
>
> I noticed an inconsistent behavior when using rdd.randomSplit when the
> source rdd is repartitioned, but only in YARN mode. It works fine in local
> mode though.
>
> *Code:*
> val rdd = sc.parallelize(1 to 100)
> val rdd2 = rdd.repartition(64)
> rdd.partitions.size
> rdd2.partitions.size
> val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1)
> train.takeOrdered(10)
> test.takeOrdered(10)
>
> *Master: local*
> Both the take statements produce consistent results and have no overlap in
> numbers being outputted.
>
>
> *Master: YARN*However, when these are run on YARN mode, these produce
> random results every time and also the train and test have overlap in the
> numbers being outputted.
> If I use *rdd*.randomSplit, then it works fine even on YARN.
>
> So, it concludes that the repartition is being evaluated every time the
> splitting occurs.
>
> Interestingly, if I cache the rdd2 before splitting it, then we can expect
> consistent behavior since repartition is not evaluated again and again.
>
> Best Regards,
> Gaurav Kumar
> Big Data • Data Science • Photography • Music
> +91 9953294125
>


Inconsistent behavior of randomSplit in YARN mode

2015-12-27 Thread Gaurav Kumar
Hi,

I noticed an inconsistent behavior when using rdd.randomSplit when the
source rdd is repartitioned, but only in YARN mode. It works fine in local
mode though.

*Code:*
val rdd = sc.parallelize(1 to 100)
val rdd2 = rdd.repartition(64)
rdd.partitions.size
rdd2.partitions.size
val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1)
train.takeOrdered(10)
test.takeOrdered(10)

*Master: local*
Both the take statements produce consistent results and have no overlap in
numbers being outputted.


*Master: YARN*However, when these are run on YARN mode, these produce
random results every time and also the train and test have overlap in the
numbers being outputted.
If I use *rdd*.randomSplit, then it works fine even on YARN.

So, it concludes that the repartition is being evaluated every time the
splitting occurs.

Interestingly, if I cache the rdd2 before splitting it, then we can expect
consistent behavior since repartition is not evaluated again and again.

Best Regards,
Gaurav Kumar
Big Data • Data Science • Photography • Music
+91 9953294125