Hi,

I am on Spark 0.9.0

I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
cores in the cluster).
I have an input rdd with 64 partitions.

I am running  "sc.mapPartitions(...).reduce(...)"

I can see that I get full parallelism on the mapper (all my 32 cores are
busy simultaneously). However, when it comes to reduce(), the outputs of the
mappers are all reduced SERIALLY. Further, all the reduce processing happens
only on 1 of the workers.

I was expecting that the outputs of the 16 mappers on node 1 would be
reduced in parallel in node 1 while the outputs of the 16 mappers on node 2
would be reduced in parallel on node 2 and there would be 1 final inter-node
reduce (of node 1 reduced result and node 2 reduced result).

Isn't parallel reduce supported WITHIN a key (in this case I have no key) ?
(I know that there is parallelism in reduce across keys)

Best Regards
Anand



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to