I think using cron tab will be a good solution. Just using a test script to ensure the living processes and restart them when they are down.


Norbert Burger 写道:
Along these lines, I'm curious what "management tools" folks are using to
ensure cluster availability (ie., auto-restart failed datanodes/namenodes).

Are you using a custom cron script, or maybe something more complex
(Ganglia, Nagios, puppet, etc.)?

Thanks,
Norbert

On 10/28/08, Steve Loughran <[EMAIL PROTECTED]> wrote:
wmitchell wrote:

Hi All,

Ive been working michael nolls multi-node cluster setup example
(Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
then on my slave machine -- which is currently running a datanode killed
the
process in an effort to try to simulate some sort of failure on the slave
machine datanode. I had assumed that the namenode would have been polling
its datanodes and thus attempted to bring up any node that goes down. On
looking at my slave machine it seems that the datanode process is still
down
(I've checked jps).


That's up to you or your management tools. The namenode knows that the
datanode is unreachable, but doesn't know how to go about reconnecting it to
the network. Which, given there are many causes of "down", sort of makes
sense. The switch failing, the hdds dying or the process crashing, all look
the same: no datanode heartbeats.

.



Reply via email to