Hi, I noticed an inconsistent behavior when using rdd.randomSplit when the source rdd is repartitioned, but only in YARN mode. It works fine in local mode though.
*Code:* val rdd = sc.parallelize(1 to 1000000) val rdd2 = rdd.repartition(64) rdd.partitions.size rdd2.partitions.size val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1) train.takeOrdered(10) test.takeOrdered(10) *Master: local* Both the take statements produce consistent results and have no overlap in numbers being outputted. *Master: YARN*However, when these are run on YARN mode, these produce random results every time and also the train and test have overlap in the numbers being outputted. If I use *rdd*.randomSplit, then it works fine even on YARN. So, it concludes that the repartition is being evaluated every time the splitting occurs. Interestingly, if I cache the rdd2 before splitting it, then we can expect consistent behavior since repartition is not evaluated again and again. Best Regards, Gaurav Kumar Big Data • Data Science • Photography • Music +91 9953294125