? The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3]) y = sc.parallelize([1, 2, 3]) z = x.distinct() print x.zip(y).collect() The error that is produced depends on whether multiple partitions have been specified or not. I understand that the two RDDs [must] have the same number of partitions and the same number of elements in each partition. What is the best way to work around this restriction? I have been performing the operation with the following code, but I am hoping to find something more efficient. def safe_zip(left, right): ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0])) ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0])) return ix_left.join(ix_right).sortByKey().values()