We should implement this using cogroup(); it will just require some tracking to map Python partitioners into dummy Java ones so that Java Spark’s cogroup() operator respects Python’s partitioning. I’m sure that there are some other subtleties, particularly if we mix datasets that use different serializers / Java object representations.
There’s a longstanding JIRA to fix this: https://issues.apache.org/jira/browse/SPARK-655 On November 13, 2014 at 5:08:15 AM, 夏俊鸾 (xiajunl...@gmail.com) wrote: Hi all I have noticed that “Join” operator has been transferred to union and groupByKey operator instead of cogroup operator in PySpark, this change will probably generate more shuffle stage, for example rdd1 = sc.makeRDD(...).partitionBy(2) rdd2 = sc.makeRDD(...).partitionBy(2) rdd3 = rdd1.join(rdd2).collect() Above code implemented with scala will generate 2 shuffle, but will generate 3 shuffle with python. what is initial design motivation of join operator in PySpark? Any idea to improve join performance in PySpark? Andrew