Re: Shuffle phase: fine-grained control of data flow

Harsh J Wed, 07 Nov 2012 06:06:57 -0800

Hi Jiwei,

In trunk (i.e. MR2), the completion events selection + scheduling
logic lies under class EventFetcher's getMapCompletionEvents() method,
as viewable at 
http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/EventFetcher.java?view=markup

This EventFetcher thread is used by the Shuffle (reduce package)
class, to continually do the shuffling. The Shuffle class is then
itself used by the ReduceTask class (look in mapred package of same
maven module).

I guess you can start there, to see if a better selection+scheduling
logic would yield better results.

On Wed, Nov 7, 2012 at 12:26 PM, Jiwei Li <cxm...@gmail.com> wrote:
> Dear all,
>
> For jobs like Sort, massive amounts of network traffic happen during
> shuffle phase. The simple mechanism in Hadoop 1.0.4 to choose reduce nodes
> does not help reduce network traffic. If JobTracker is fully aware of
> locations of every map output, why not take advantage of this topology
> knowledge?
>
> So, is there anyone who knows where to develop such codes upon? Many thanks.
>
> Regards.
> --
> Jiwei

-- 
Harsh J

Re: Shuffle phase: fine-grained control of data flow

Reply via email to