I couldn't reproduce your issue locally, but I suspect it has something to
do with partitioning. zip() does it by partition and it assumes the two
RDDs have the same number of partitions and the same number of elements in
each partition. By default, map() doesn't preserve partitioning. Try set
preservesPartitioning to True and see if the problem persists.


On Sat, Jun 21, 2014 at 9:37 AM, madeleine <madeleine.ud...@gmail.com>
wrote:

> Consider the following simple zip:
>
> n = 6
> a = sc.parallelize(range(n))
> b = sc.parallelize(range(n)).map(lambda j: j)
> c = a.zip(b)
> print a.count(), b.count(), c.count()
>
> >> 6 6 4
>
> by varying n, I find that c.count() is always min(n,4), where 4 happens to
> be the number of threads on my computer. by calling c.collect(), I see the
> RDD has simply been truncated to the first 4 entries. weirdly, this doesn't
> happen without calling map on b.
>
> Any ideas?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/zip-in-pyspark-truncates-RDD-to-number-of-processors-tp8069.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to