On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:

> Thanks, I'm independently doing some digging into Hadoop networking
> requirements and 
> had a couple of quick follow-ups. Could I have some specific info on why
> different data centers 
> cannot be supported for master node and data node comms?
> Also, what 
> may be the benefits/use cases for such a scenario?

        Most people who try to put the NN and DNs in different data centers are 
trying to achieve disaster recovery:  one file system in multiple locations.  
That isn't the way HDFS is designed and it will end in tears. There are 
multiple problems:

1) no guarantee that one block replica will be each data center (thereby 
defeating the whole purpose!)
2) assuming one can work out problem 1, during a network break, the NN will 
lose contact from one half of the  DNs, causing a massive network replication 
storm
3) if one using MR on top of this HDFS, the shuffle will likely kill the 
network in between (making MR performance pretty dreadful) is going to cause 
delays for the DN heartbeats
4) I don't even want to think about rebalancing.

        ... and I'm sure a lot of other problems I'm forgetting at the moment.  
So don't do it.

        If you want disaster recovery, set up two completely separate HDFSes 
and run everything in parallel.

Reply via email to