Hi,

I've seen in a few cases that when calling a reduce operation, it is
executed sequentially rather than in parallel. 

For example, I have the following code that performs a simple word counting
on very big data using hashmaps (instead of (word,1) pairs that would
overflow the memory at shuffle time)   :

rdd.mapPartitions(
   iter => { 
      val x = new HashMap[Int,Int]
      // fill x with (word, count) values
      val rez = new ArrayBuffer[HashMap]
      rez += counts
      rez.toArray.iterator 
   })
   .reduce({ case (h1, h2) => {
          for (key <- h2.keys()) {
            if (h1.containsKey(key)) {
              h1.put(key, h1.get(key) + h2.get(key))
            } else {
              h1.put(key, h2.get(key))
            }
          }
          h2.clear()
          h1
        }
        })


After all the mappers are done, the process becomes single threaded where
each reducer is executed sequentially. This is very time inefficient and I
don't understand why the reduce operation is not executed in parallel as
expected.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reducer-doesn-t-operate-in-parallel-tp26389.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to