About Join operator in PySpark

夏俊鸾 Tue, 11 Nov 2014 23:31:56 -0800

Hi all

    I have noticed that “Join” operator has been transferred to union and
groupByKey operator instead of cogroup operator in PySpark, this change
will probably generate more shuffle stage, for example


    rdd1 = sc.makeRDD(...).partitionBy(2)
    rdd2 = sc.makeRDD(...).partitionBy(2)
    rdd3 = rdd1.join().collect()

    Above code implemented with scala will generate 2 shuffle, but will
generate 3 shuffle with PySpark. what is initial design motivation of join
operator in PySpark? Any idea to improve join performance in PySpark?

Andrew

About Join operator in PySpark

Reply via email to