Brian Bockelman wrote:
Just to toss out some numbers.... (and because our users are making interesting numbers right now)

Here's our external network router: http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets

Here's the application-level transfer graph: http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska

In a squeeze, we can move 20-50TB / day to/from other heterogenous sites. Usually, we run out of free space before we can find the upper limit for a 24-hour period.

We use a protocol called GridFTP to move data back and forth between external (non-HDFS) clusters. The other sites we transfer with use niche software you probably haven't heard of (Castor, DPM, and dCache) because, well, it's niche software. I have no available data on HDFS<->S3 systems, but I'd again claim it's mostly a function of the amount of hardware you throw at it and the size of your network pipes.

There are currently 182 datanodes; 180 are "traditional" ones of <3TB and 2 are big honking RAID arrays of 40TB. Transfers are load-balanced amongst ~ 7 GridFTP servers which each have 1Gbps connection.


GridFTP is optimised for high bandwidth network connections with negotiated packet size and multiple TCP connections, so when nagel's algorithm triggers backoff from a dropped packet, only a fraction of the transmission gets dropped. It is probably best-in-class for long haul transfers over the big university backbones where someone else pays for your traffic. You would be very hard pressed to get even close to that on any other protocol.

I have no data on S3 xfers other than hearsay
* write time to S3 can be slow as it doesn't return until the data is persisted "somewhere". That's a better guarantee than a posix write operation. * you have to rely on other people on your rack not wanting all the traffic for themselves. That's an EC2 API issue: you don't get to request/buy bandwidth to/from S3

One thing to remember is that if you bring up a Hadoop cluster on any virtual server farm, disk IO is going to be way below physical IO rates. Even when the data is in HDFS, it will be slower to get at than dedicated high-RPM SCSI or SATA storage.

Reply via email to