Re: File Transfer Rates

Steve Loughran Wed, 11 Feb 2009 03:19:17 -0800

Brian Bockelman wrote:

Just to toss out some numbers.... (and because our users are makinginteresting numbers right now)
Here's our external network router:http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets
Here's the application-level transfer graph:http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska
In a squeeze, we can move 20-50TB / day to/from other heterogenoussites. Usually, we run out of free space before we can find the upperlimit for a 24-hour period.
We use a protocol called GridFTP to move data back and forth betweenexternal (non-HDFS) clusters. The other sites we transfer with useniche software you probably haven't heard of (Castor, DPM, and dCache)because, well, it's niche software. I have no available data onHDFS<->S3 systems, but I'd again claim it's mostly a function of theamount of hardware you throw at it and the size of your network pipes.
There are currently 182 datanodes; 180 are "traditional" ones of <3TBand 2 are big honking RAID arrays of 40TB. Transfers are load-balancedamongst ~ 7 GridFTP servers which each have 1Gbps connection.

GridFTP is optimised for high bandwidth network connections withnegotiated packet size and multiple TCP connections, so when nagel'salgorithm triggers backoff from a dropped packet, only a fraction of thetransmission gets dropped. It is probably best-in-class for long haultransfers over the big university backbones where someone else pays foryour traffic. You would be very hard pressed to get even close to thaton any other protocol.


I have no data on S3 xfers other than hearsay

* write time to S3 can be slow as it doesn't return until the data ispersisted "somewhere". That's a better guarantee than a posix writeoperation.* you have to rely on other people on your rack not wanting all thetraffic for themselves. That's an EC2 API issue: you don't get torequest/buy bandwidth to/from S3

One thing to remember is that if you bring up a Hadoop cluster on anyvirtual server farm, disk IO is going to be way below physical IO rates.Even when the data is in HDFS, it will be slower to get at thandedicated high-RPM SCSI or SATA storage.

Re: File Transfer Rates

Reply via email to