We should implement this using cogroup(); it will just require some tracking to 
map Python partitioners into dummy Java ones so that Java Spark’s cogroup() 
operator respects Python’s partitioning.  I’m sure that there are some other 
subtleties, particularly if we mix datasets that use different serializers / 
Java object representations.

There’s a longstanding JIRA to fix this: 
https://issues.apache.org/jira/browse/SPARK-655

On November 13, 2014 at 5:08:15 AM, 夏俊鸾 (xiajunl...@gmail.com) wrote:

Hi all  

I have noticed that “Join” operator has been transferred to union and  
groupByKey operator instead of cogroup operator in PySpark, this change  
will probably generate more shuffle stage, for example  

rdd1 = sc.makeRDD(...).partitionBy(2)  
rdd2 = sc.makeRDD(...).partitionBy(2)  
rdd3 = rdd1.join(rdd2).collect()  

Above code implemented with scala will generate 2 shuffle, but will  
generate 3 shuffle with python. what is initial design motivation of join  
operator in PySpark? Any idea to improve join performance in PySpark?  

Andrew  

Reply via email to