Along these lines, I'm curious what "management tools" folks are using to ensure cluster availability (ie., auto-restart failed datanodes/namenodes).
Are you using a custom cron script, or maybe something more complex (Ganglia, Nagios, puppet, etc.)? Thanks, Norbert On 10/28/08, Steve Loughran <[EMAIL PROTECTED]> wrote: > > wmitchell wrote: > >> Hi All, >> >> Ive been working michael nolls multi-node cluster setup example >> (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I >> then on my slave machine -- which is currently running a datanode killed >> the >> process in an effort to try to simulate some sort of failure on the slave >> machine datanode. I had assumed that the namenode would have been polling >> its datanodes and thus attempted to bring up any node that goes down. On >> looking at my slave machine it seems that the datanode process is still >> down >> (I've checked jps). >> >> > That's up to you or your management tools. The namenode knows that the > datanode is unreachable, but doesn't know how to go about reconnecting it to > the network. Which, given there are many causes of "down", sort of makes > sense. The switch failing, the hdds dying or the process crashing, all look > the same: no datanode heartbeats. >