Really enthralled to read the post : http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ Great job done.
Some related questions: 1. The article says that hdfs always maintains 2 copies in the same rack and 3rd in a different rack. This only speeds up the hdfs "put" ( fileCreation ) time. But wont it be better be to spread it across 3 racks ? What other advantage will it have for this 2+1 approach. 2. In HDFS the client reads block sequentially. Why the clients cant read the blocks parallel-y ? wont it speed up lookups from client's perspective ? 3. There are some cases in which a Data Node daemon itself will need to read a block of data from HDFS. When would a data node need to read from other data nodes ? Is it when split-size is more than block size ? Even in that case its the tasktracker which should ask for the data and not the data node -Thanks Prasenjit .