[jira] [Updated] (SPARK-5785) Pyspark does not support narrow dependencies

2015-02-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5785:
--
Assignee: Davies Liu

 Pyspark does not support narrow dependencies
 

 Key: SPARK-5785
 URL: https://issues.apache.org/jira/browse/SPARK-5785
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Imran Rashid
Assignee: Davies Liu
 Fix For: 1.3.0


 joins ( cogroups etc.) are always considered to have wide dependencies in 
 pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
 this simple job should shuffle rddA  rddB once each, but it also will do a 
 third shuffle of the unioned data:
 {code}
 rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
 rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
 joined = rddA.join(rddB)
 joined.count()
  rddA._partitionFunc == rddB._partitionFunc
 True
 {code}
 (Or the docs should somewhere explain that this feature is missing from 
 pyspark.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5785) Pyspark does not support narrow dependencies

2015-02-13 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5785:

Description: 
joins ( cogroups etc.) are always considered to have wide dependencies in 
pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
this simple job should shuffle rddA  rddB once each, but it also will do a 
third shuffle of the unioned data:

{code}
rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

joined = rddA.join(rddB)
joined.count()

 rddA._partitionFunc == rddB._partitionFunc
True
{code}


(Or the docs should somewhere explain that this feature is missing from 
pyspark.)

  was:
joins ( cogroups etc.) are always considered to have wide dependencies in 
pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
this simple job should shuffle rddA  rddB once each, but it also will do a 
third shuffle of the unioned data:

{code}
rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

joined = rddA.join(rddB)
joined.count()

 rddA._partitionFunc == rddB._partitionFunc
True
{code}


(Or the docs should somewhere explain that this feature is missing from spark.)


 Pyspark does not support narrow dependencies
 

 Key: SPARK-5785
 URL: https://issues.apache.org/jira/browse/SPARK-5785
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Imran Rashid

 joins ( cogroups etc.) are always considered to have wide dependencies in 
 pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
 this simple job should shuffle rddA  rddB once each, but it also will do a 
 third shuffle of the unioned data:
 {code}
 rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
 rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)
 joined = rddA.join(rddB)
 joined.count()
  rddA._partitionFunc == rddB._partitionFunc
 True
 {code}
 (Or the docs should somewhere explain that this feature is missing from 
 pyspark.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org