[ https://issues.apache.org/jira/browse/SPARK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093133#comment-14093133 ]
Apache Spark commented on SPARK-2790: ------------------------------------- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1894 > PySpark zip() doesn't work properly if RDDs have different serializers > ---------------------------------------------------------------------- > > Key: SPARK-2790 > URL: https://issues.apache.org/jira/browse/SPARK-2790 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.0.0, 1.1.0 > Reporter: Josh Rosen > Assignee: Davies Liu > Priority: Critical > > In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have > different serializers (e.g. batched vs. unbatched), even if those RDDs have > the same number of partitions and same numbers of elements. This problem > occurs in the MLlib Python APIs, where we might want to zip a JavaRDD of > LabelledPoints with a JavaRDD of batch-serialized Python objects. > This is problematic because whether zip() succeeds or errors depends on the > partitioning / batching strategy, and we don't want to surface the > serialization details to users. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org