Larger Hadoop installations are space dense, 20-40 nodes per rack. When you get to that density with multiple racks, it becomes expensive to buy a switch with enough capacity for all of the nodes in all of the racks. The typical solution is to install a switch per rack with uplinks to a core switch to route between the racks. In this arrangement, you'll be limited by the uplink bandwidth to the core switch for interrack communication. Typically these uplinks are 10-20 Gbps (bidirectional).
Assuming you have 32 nodes in a rack with 1 Gbps links, then 20 Gbps isn't enough bandwidth to push all of those ports at full tilt between racks. That's why Hadoop has the ability to take advantage of rack locality. It will try to schedule I/O local to a rack where it's less likely to block. -Joey On Mon, Jun 6, 2011 at 7:04 AM, elton sky <eltonsky9...@gmail.com> wrote: > Thanks for reply, Steve, > > I totally agree benchmark is a good idea. But the problem is I don't have > switch to play with rather than a small cluster. > I am curious of this and post the question. > Can some experienced ppl can share their knowledge with us? > > Cheers > > On Mon, Jun 6, 2011 at 7:28 PM, Steve Loughran <ste...@apache.org> wrote: > >> On 06/06/11 08:22, elton sky wrote: >> >>> hello everyone, >>> >>> As I don't have experience with big scale cluster, I cannot figure out why >>> the inter-rack communication in a mapreduce job is "significantly" slower >>> than intra-rack. >>> I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding >>> capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not >>> be >>> much contention at the switch, is it? >>> >> >> I don't know enough about these switches; I do hear stories about buffering >> and the like, and I also hear that a lot of switches don't always expect all >> the ports to light up simultaneously. >> >> Outside hadoop, try setting up some simple bandwidth tests to measure >> inter-rack bandwidth: have every node on one rack try and talk to one on >> another at full rate. >> >> Set up every node talking to every other node at least once, to make sure >> there aren't odd problems between two nodes, which can happen if one of the >> NICs is playing up. >> >> Once you are happy that the basic bandwidth between servers is OK, then >> it's time to start worrying adding hadoop to the mix >> >> -steve >> > -- Joseph Echeverria Cloudera, Inc. 443.305.9434