Brian Bockelman wrote:
Just to toss out some numbers.... (and because our users are making
interesting numbers right now)
Here's our external network router:
http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets
Here's the application-level transfer graph:
http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska
In a squeeze, we can move 20-50TB / day to/from other heterogenous
sites. Usually, we run out of free space before we can find the upper
limit for a 24-hour period.
We use a protocol called GridFTP to move data back and forth between
external (non-HDFS) clusters. The other sites we transfer with use
niche software you probably haven't heard of (Castor, DPM, and dCache)
because, well, it's niche software. I have no available data on
HDFS<->S3 systems, but I'd again claim it's mostly a function of the
amount of hardware you throw at it and the size of your network pipes.
There are currently 182 datanodes; 180 are "traditional" ones of <3TB
and 2 are big honking RAID arrays of 40TB. Transfers are load-balanced
amongst ~ 7 GridFTP servers which each have 1Gbps connection.
GridFTP is optimised for high bandwidth network connections with
negotiated packet size and multiple TCP connections, so when nagel's
algorithm triggers backoff from a dropped packet, only a fraction of the
transmission gets dropped. It is probably best-in-class for long haul
transfers over the big university backbones where someone else pays for
your traffic. You would be very hard pressed to get even close to that
on any other protocol.
I have no data on S3 xfers other than hearsay
* write time to S3 can be slow as it doesn't return until the data is
persisted "somewhere". That's a better guarantee than a posix write
operation.
* you have to rely on other people on your rack not wanting all the
traffic for themselves. That's an EC2 API issue: you don't get to
request/buy bandwidth to/from S3
One thing to remember is that if you bring up a Hadoop cluster on any
virtual server farm, disk IO is going to be way below physical IO rates.
Even when the data is in HDFS, it will be slower to get at than
dedicated high-RPM SCSI or SATA storage.