I've discovered that it was noticed a year ago that RDD zip() does not work when the number of partitions does not evenly divide the total number of elements in the RDD:
https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ I will enter a JIRA ticket just as soon as the ASF Jira system will let me reset my password. On Sunday, May 11, 2014 4:40 AM, Michael Malak <michaelma...@yahoo.com> wrote: Is this a bug? scala> sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect res1: Array[(Long, Int)] = Array((2,11))