Consider the following simple zip: n = 6 a = sc.parallelize(range(n)) b = sc.parallelize(range(n)).map(lambda j: j) c = a.zip(b) print a.count(), b.count(), c.count()
>> 6 6 4 by varying n, I find that c.count() is always min(n,4), where 4 happens to be the number of threads on my computer. by calling c.collect(), I see the RDD has simply been truncated to the first 4 entries. weirdly, this doesn't happen without calling map on b. Any ideas? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zip-in-pyspark-truncates-RDD-to-number-of-processors-tp8069.html Sent from the Apache Spark User List mailing list archive at Nabble.com.