Well the problem is pretty basic.

Take your typical 1 GBe switch with 42 ports.
Each port is capable of doing 1 GBe in each direction across the switche's 
fabric.
Depending on your hardware, that's a fabric of 40GB, shared.

Depending on your hardware, you are usually using 1 or maybe 2 ports to 'trunk' 
to your network's back plane. (To keep this simple, lets just say that its a 
1-2 GBe 'trunk' to your next rack.
So you end up with 1GBe traffic from each node trying to communicate to another 
node on the next rack.  So if that's 20 nodes per rack and they all want to 
communicate... you end up with 20 GBe (each direction) trying to fit through a 
1 - 2 GBe  pipe.

Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people 
don't know how to drive. :-P

The quick fix... spend the 8-10K per switch  to get a ToR that has 10+ GBe 
uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per rack.

JMHO

-Mike



> To: common-user@hadoop.apache.org
> Subject: Re: Why inter-rack communication in mapreduce slow?
> Date: Mon, 6 Jun 2011 11:00:05 -0400
> From: dar...@ontrenet.com
> 
> 
> IMO, that's right. Because map/reduce/hadoop was originally designed for
> that kind of text processing purpose. (i.e. few stages, low dependency,
> highly parallel).
> 
> Its when one tries to solve general purpose algorithms of modest
> complexity that map/reduce gets into I/O churning problems. 
> 
> On Mon, 6 Jun 2011 23:58:53 +1000, elton sky <eltonsky9...@gmail.com>
> wrote:
> > Hi John,
> > 
> > Because for map task, job tracker tries to assign them to local data
> nodes,
> > so there' not much n/w traffic.
> > Then the only potential issue will be, as you said, reducers, which
> copies
> > data from all maps.
> > So in other words, if the application only creates small intermediate
> > output, e.g. grep, wordcount, this jam between racks is not likely
> happen,
> > is it?
> > 
> > 
> > On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
> > <john.armstr...@ccri.com>wrote:
> > 
> >> On Mon, 06 Jun 2011 09:34:56 -0400, <dar...@ontrenet.com> wrote:
> >> > Yeah, that's a good point.
> >> >
> >> > I wonder though, what the load on the tracker nodes (port et. al)
> would
> >> > be if a inter-rack fiber switch at 10's of GBS' is getting maxed.
> >> >
> >> > Seems to me that if there is that much traffic being mitigate across
> >> > racks, that the tracker node (or whatever node it is) would overload
> >> > first?
> >>
> >> It could happen, but I don't think it would always.  For example,
> tracker
> >> is on rack A; sees that the best place to put reducer R is on rack B;
> >> sees
> >> reducer still needs a few hellabytes from mapper M on rack C; tells M
> to
> >> send data to R; switches on B and C get throttled, leaving A free to
> >> handle
> >> other things.
> >>
> >> In fact, it almost makes me wonder if an ideal setup is not only to
> have
> >> each of the main control daemons on their own nodes, but to put THOSE
> >> nodes
> >> on their own rack and keep all the data elsewhere.
> >>
                                          

Reply via email to