Hi, I am on Spark 0.9.0
I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32 cores in the cluster). I have an input rdd with 64 partitions. I am running "sc.mapPartitions(...).reduce(...)" I can see that I get full parallelism on the mapper (all my 32 cores are busy simultaneously). However, when it comes to reduce(), the outputs of the mappers are all reduced SERIALLY. Further, all the reduce processing happens only on 1 of the workers. I was expecting that the outputs of the 16 mappers on node 1 would be reduced in parallel in node 1 while the outputs of the 16 mappers on node 2 would be reduced in parallel on node 2 and there would be 1 final inter-node reduce (of node 1 reduced result and node 2 reduced result). Isn't parallel reduce supported WITHIN a key (in this case I have no key) ? (I know that there is parallelism in reduce across keys) Best Regards Anand -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html Sent from the Apache Spark User List mailing list archive at Nabble.com.