Stephen,
I highly recommend you get switches that have a few 10GigE ports on
them for interlinking.
So say you have 40 servers/rack, then get top-of-rack switches with
forty 1GigE ports and two 10GigE ports, this way you can have half the
servers talking to the other rack without bottlenecks. The prices for
such switches have been dropping significantly in recent months (you
should be able to get them for sub $10K). Take a look at the switches
from Arista Networks (www.aristanetworks.com) for competitive pricing.
Cheers,
-- amr
Todd Lipcon wrote:
Hi Stephen,
The true answer depends on the types of jobs you're running. As a back of
the envelope calculation I might figure something like this:
60 nodes total = 30 nodes per rack
Each node might process about 100MB/sec of data
In the case of a sort job where the intermediate data is the same size as
the input data, that means each node needs to shuffle 100MB/sec of data
In aggregate, each rack is then producing about 3GB/sec of data
However, given even reducer spread across the racks, each rack will need to
send 1.5GB/sec to reducers running on the other rack.
Since the connection is full duplex, that means you need 1.5GB/sec of
bisection bandwidth for this theoretical job. So that's 12Gbps.
However, the above calculations are probably somewhat of an upper bound. A
large number of jobs have significant data reduction during the map phase,
either by some kind of filtering/selection going on in the Mapper itself, or
by good usage of Combiners. Additionally, intermediate data compression can
cut the intermediate data transfer by a significant factor. Lastly, although
your disks can probably provide 100MB sustained throughput, it's rare to see
a MR job which can sustain disk speed IO through the entire pipeline. So,
I'd say my estimate is at least a factor of 2 too high.
So, the simple answer is that 4-6Gbps is most likely just fine for most
practical jobs. If you want to be extra safe, many inexpensive switches can
operate in a "stacked" configuration where the bandwidth between them is
essentially backplane speed. That should scale you to 96 nodes with plenty
of headroom.
-Todd
On Tue, May 26, 2009 at 3:10 AM, stephen mulcahy
<stephen.mulc...@deri.org>wrote:
Hi,
Has anyone here investigated what level of bisection bandwidth is needed
for a Hadoop cluster which spans more than one rack?
I'm currently sizing and planning a new Hadoop cluster and I'm wondering
what the performance implications will be if we end up with a cluster spread
across two racks. I'd expect we'll have one 48-port gigabit switch in each
42u rack. If we end up with 60 systems spread across these two switches -
how much bandwidth should I have between the racks?
I'll have 6 gigabit ports available for links between racks - i.e. up to 6
Gbps. Would this be sufficient bisection bandwidth for Hadoop or should I be
considering increased bandwidth between racks (maybe using fibre links
between the switches or introducing another switch)?
Thanks for any thoughts on this.
-stephen
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie http://webstar.deri.ie http://sindice.com