Re: Why inter-rack communication in mapreduce slow?

Joey Echeverria Mon, 06 Jun 2011 06:09:16 -0700

Larger Hadoop installations are space dense, 20-40 nodes per rack.
When you get to that density with multiple racks, it becomes expensive
to buy a switch with enough capacity for all of the nodes in all of
the racks. The typical solution is to install a switch per rack with
uplinks to a core switch to route between the racks. In this
arrangement, you'll be limited by the uplink bandwidth to the core
switch for interrack communication. Typically these uplinks are 10-20
Gbps (bidirectional).


Assuming you have 32 nodes in a rack with 1 Gbps links, then 20 Gbps
isn't enough bandwidth to push all of those ports at full tilt between
racks. That's why Hadoop has the ability to take advantage of rack
locality. It will try to schedule I/O local to a rack where it's less
likely to block.

-Joey

On Mon, Jun 6, 2011 at 7:04 AM, elton sky <eltonsky9...@gmail.com> wrote:
> Thanks for reply, Steve,
>
> I totally agree benchmark is a good idea. But the problem is I don't have
> switch to play with rather than a small cluster.
> I am curious of this and post the question.
> Can some experienced ppl can share their knowledge with us?
>
> Cheers
>
> On Mon, Jun 6, 2011 at 7:28 PM, Steve Loughran <ste...@apache.org> wrote:
>
>> On 06/06/11 08:22, elton sky wrote:
>>
>>> hello everyone,
>>>
>>> As I don't have experience with big scale cluster, I cannot figure out why
>>> the inter-rack communication in a mapreduce job is "significantly" slower
>>> than intra-rack.
>>> I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding
>>> capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not
>>> be
>>> much contention at the switch, is it?
>>>
>>
>> I don't know enough about these switches; I do hear stories about buffering
>> and the like, and I also hear that a lot of switches don't always expect all
>> the ports to light up simultaneously.
>>
>> Outside hadoop, try setting up some simple bandwidth tests to measure
>> inter-rack bandwidth: have every node on one rack try and talk to one on
>> another at full rate.
>>
>> Set up every node talking to every other node at least once, to make sure
>> there aren't odd problems between two nodes, which can happen if one of the
>> NICs is playing up.
>>
>> Once you are happy that the basic bandwidth between servers is OK, then
>> it's time to start worrying adding hadoop to the mix
>>
>> -steve
>>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Why inter-rack communication in mapreduce slow?

Reply via email to