Hi all I have noticed that “Join” operator has been transferred to union and groupByKey operator instead of cogroup operator in PySpark, this change will probably generate more shuffle stage, for example
rdd1 = sc.makeRDD(...).partitionBy(2) rdd2 = sc.makeRDD(...).partitionBy(2) rdd3 = rdd1.join().collect() Above code implemented with scala will generate 2 shuffle, but will generate 3 shuffle with PySpark. what is initial design motivation of join operator in PySpark? Any idea to improve join performance in PySpark? Andrew