[ https://issues.apache.org/jira/browse/SPARK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen resolved SPARK-5785. ------------------------------- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4629 [https://github.com/apache/spark/pull/4629] > Pyspark does not support narrow dependencies > -------------------------------------------- > > Key: SPARK-5785 > URL: https://issues.apache.org/jira/browse/SPARK-5785 > Project: Spark > Issue Type: Improvement > Components: PySpark > Reporter: Imran Rashid > Fix For: 1.3.0 > > > joins (& cogroups etc.) are always considered to have "wide" dependencies in > pyspark, they are never narrow. This can cause unnecessary shuffles. eg., > this simple job should shuffle rddA & rddB once each, but it also will do a > third shuffle of the unioned data: > {code} > rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) > rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64) > joined = rddA.join(rddB) > joined.count() > >>> rddA._partitionFunc == rddB._partitionFunc > True > {code} > (Or the docs should somewhere explain that this feature is missing from > pyspark.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org