For distribution of load you can start reading some chapters from different
types of hadoop scheduler. I have not yet studied other implementation like
hadoop, however a very simplified version of distribution concept is the
following:
a) Tasktracker ask for work (heartbeat consist of a status of the worker
node - # free slots)
b) Jobtracker pick a job from a list which is sorted based on the specified
policy (fairscheduling, fifo, lifo, other sla)
c) Tasktracker executes the map/reduce job
Like mentioned before there are a lot more details.. In b) there exists an
implementation of delay scheduling which is there to improve throughput by
taking account of input data location for a picked job. There you have a
preemption mechanism that regulate the fairness between pools,etc..
A good start is book that Preshant mentioned...
On 23 April 2012 23:49, Prashant Kommireddi wrote:
> Shailesh, there's a lot that goes into distributing work across
> tasks/nodes. It's not just distributing work but also fault-tolerance,
> data locality etc that come into play. It might be good to refer
> Hadoop apache docs or Tom White's definitive guide.
>
> Sent from my iPhone
>
> On Apr 23, 2012, at 11:03 AM, Shailesh Samudrala
> wrote:
>
> > Hello,
> >
> > I am trying to design my own MapReduce Implementation and I want to know
> > how hadoop is able to distribute its workload across multiple computers.
> > Can anyone shed more light on this? thanks!
>