[
https://issues.apache.org/jira/browse/SPARK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Imran Rashid updated SPARK-5785:
Description:
joins ( cogroups etc.) are always considered to have wide dependencies in
pyspark, they are never narrow. This can cause unnecessary shuffles. eg.,
this simple job should shuffle rddA rddB once each, but it also will do a
third shuffle of the unioned data:
{code}
rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
joined = rddA.join(rddB)
joined.count()
rddA._partitionFunc == rddB._partitionFunc
True
{code}
(Or the docs should somewhere explain that this feature is missing from
pyspark.)
was:
joins ( cogroups etc.) are always considered to have wide dependencies in
pyspark, they are never narrow. This can cause unnecessary shuffles. eg.,
this simple job should shuffle rddA rddB once each, but it also will do a
third shuffle of the unioned data:
{code}
rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
joined = rddA.join(rddB)
joined.count()
rddA._partitionFunc == rddB._partitionFunc
True
{code}
(Or the docs should somewhere explain that this feature is missing from spark.)
Pyspark does not support narrow dependencies
Key: SPARK-5785
URL: https://issues.apache.org/jira/browse/SPARK-5785
Project: Spark
Issue Type: Improvement
Components: PySpark
Reporter: Imran Rashid
joins ( cogroups etc.) are always considered to have wide dependencies in
pyspark, they are never narrow. This can cause unnecessary shuffles. eg.,
this simple job should shuffle rddA rddB once each, but it also will do a
third shuffle of the unioned data:
{code}
rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
joined = rddA.join(rddB)
joined.count()
rddA._partitionFunc == rddB._partitionFunc
True
{code}
(Or the docs should somewhere explain that this feature is missing from
pyspark.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org