Will's right, meta-data transactions go through the Namenode, but all the 
content data 
read/write activity is directly between Clients and Datanodes, and replication 
activity is 
Datanode-to-Datanode.  No bottlenecks, as long as your Namenode has enough ram 
to 
hold the namespace in memory, and enough cores to handle a modestly high 
transaction
rate.

And if the individual data files are large (Hadoop-scale "large", that is :-) 
), you can even 
decrease the meta-data/data ratio by increasing the block size from the default 
32MB 
to 64MB or even 128MB.

--Matt


On May 10, 2011, at 6:03 AM, Will Maier wrote:

Hi Jonathan-

On Tue, May 10, 2011 at 05:50:03AM -0700, Jonathan Disher wrote:
> I will preface this with a couple statements: a) it's almost 6am, and I've
> been up all night b) I'm drugged up from an allergic reaction, so I may not be
> firing on all 64 bits.
> 
> Do I correctly understand the HDFS architecture in that the namenode is a
> network bottleneck into the system? I.e., it doesn't really matter how many
> ethernet interfaces I roll into my data nodes, I will always be limited in
> how much traffic I can drive to the HDFS pool by the network capacity of the
> namenode?

No. This diagram should help:

   
http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#NameNode+and+DataNodes

The Namenode is a single point of failure, not (under most imaginable
conditions) a bottleneck.

> I am trying to move a -lot- of data, and i'd like to not throttle the namenode
> (especially in the old cluster, where I cannot just bond up more interfaces).
> If there's a way to spread the inbound network (for block writes) traffic I'd
> love to hear it.

During our (highly distributed) migration, we were writing into HDFS at up to 5 
GB/s.
The more datanodes and writers you have, the faster your aggregate throughput.

-- 

Will Maier - UW High Energy Physics
cel: 608.438.6162
tel: 608.263.9692
web: http://www.hep.wisc.edu/~wcmaier/

Reply via email to