Optimize the shuffle phase (increase the parallelism)
-----------------------------------------------------

                 Key: HADOOP-1043
                 URL: https://issues.apache.org/jira/browse/HADOOP-1043
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
            Reporter: Devaraj Das
         Assigned To: Devaraj Das


In the current shuffle code, only one map output location node is accessed from 
any Reduce at any given point of time. For example, if a particular node, say 
machine1.foo.com ran 300 maps, the reducer would fetch just one output from 
there at a time. machine1.foo.com will be inserted into a Set datastructure 
(uniqueHosts) and until it gets removed from there, no other map output will be 
fetched from that machine. The fact that only one map output is fetched at a 
time from any particular host seems fine, but the logic for removing a node 
from uniqueHosts is such that there could be a lot of delay before a node gets 
deleted from the Set datastructure (even after the map output has been fetched 
from that node). This probably leads to suboptimal performance since it reduces 
the parallelism in fetching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to