Really enthralled to read the post :
http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
Great job done.

Some related questions:

1. The article says that hdfs always maintains 2 copies in the same
rack and 3rd in a different rack. This only speeds up the hdfs "put" (
fileCreation ) time. But wont it be better be to spread it across 3
racks ? What other advantage will it have for this 2+1 approach.

2.  In HDFS the client reads block sequentially. Why the clients cant
read the blocks parallel-y  ?  wont it speed up lookups from client's
perspective ?

3. There are some cases in which a Data Node daemon itself will need
to read a block of data from HDFS. When would a data node need to read
from other data nodes ? Is it  when split-size is more than block size
? Even in that case its the tasktracker which should ask for the data
and not the data node

-Thanks
Prasenjit .

Reply via email to