Anthony Ikeda wrote:
Thanks Hemanth,

In regards to different locations of the HADOOP home this is low
priority more for testing not production. I was trying to install HADOOP
for testing over 2 machines with only a Windows XP machine running
Cygwin and a Mac running Darwin. Not a priority.

Things are much easier if
 -all your machines have the same OS, disk structure
 -you are running on linux
-you use some CM tool to automate setup/deploy, pushing out of config files

Start now, start with VMWare or virtualbox images now, so you learn about management sooner rather than later

In regards to my last question about operating in a detached fashion, we
are trying to factor in what happens when the link between both sites is
cut. Will both sites operate independently until the connection is
re-established? Is there any particular setup required to ensure we can
cover this scenario or is it an out-of-the-box feature?

HDFS and the MapReduce engine is designed to run on a single datacentre with high bandwidth, high reliability links, current releases assume the facility is secure and all users are trusted. The key SPOF, the Namenode, doesn't do failover, so when it goes down or the network partitions, all machines that cannot see the NN poll and spin until it comes back -which can take a while, unless you have a secondary namenode to keep the persistent files up to date. the workers all assume that the hostname and IPAddr of the namenode doesn't change, and never reread their config. You could use DNS to do failover, but you have to tune the JVMs to not cache IP addresses for very long.

To do cross site stuff you'd need a separate HDFS filesystem per site, synchronisation of data now becomes a task for the higher level apps. I don't know what HBase, Cassandra or other column DB tools do here.


-steve


Reply via email to