Hi,

I noticed an inconsistent behavior when using rdd.randomSplit when the
source rdd is repartitioned, but only in YARN mode. It works fine in local
mode though.

*Code:*
val rdd = sc.parallelize(1 to 1000000)
val rdd2 = rdd.repartition(64)
rdd.partitions.size
rdd2.partitions.size
val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1)
train.takeOrdered(10)
test.takeOrdered(10)

*Master: local*
Both the take statements produce consistent results and have no overlap in
numbers being outputted.


*Master: YARN*However, when these are run on YARN mode, these produce
random results every time and also the train and test have overlap in the
numbers being outputted.
If I use *rdd*.randomSplit, then it works fine even on YARN.

So, it concludes that the repartition is being evaluated every time the
splitting occurs.

Interestingly, if I cache the rdd2 before splitting it, then we can expect
consistent behavior since repartition is not evaluated again and again.

Best Regards,
Gaurav Kumar
Big Data • Data Science • Photography • Music
+91 9953294125

Reply via email to