[ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017812#comment-14017812
 ] 

Kan Zhang commented on SPARK-1817:
----------------------------------

There are 2 issues related to this bug. One is that we partition numeric ranges 
(e.g., Long and Double ranges) differently from other types of sequences (i.e, 
at different indexes). This causes elements to be dropped when zipping with 
numeric ranges since we zip by partition and partitions for numeric ranges may 
have different sizes from other sequences (even if the total length and the 
number of partitions are the same). This is fixed in SPARK-1837. One caveat is 
currently partitioning Double ranges still doesn't work properly due to a Scala 
bug that breaks {{take}} and {{drop}} on Double ranges 
(https://issues.scala-lang.org/browse/SI-8518).

The other issue is instead of dropping elements silently, we should throw an 
error during zipping when we found out that partition sizes are not the same 
between 2 sequences. This is fixed by https://github.com/apache/spark/pull/944

> RDD zip erroneous when partitions do not divide RDD count
> ---------------------------------------------------------
>
>                 Key: SPARK-1817
>                 URL: https://issues.apache.org/jira/browse/SPARK-1817
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.9.0, 1.0.0
>            Reporter: Michael Malak
>            Assignee: Kan Zhang
>             Fix For: 1.1.0
>
>
> Example:
> scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
> res1: Array[(Long, Int)] = Array((2,11))
> But more generally, it's whenever the number of partitions does not evenly 
> divide the total number of elements in the RDD.
> See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to