Along these lines, I'm curious what "management tools" folks are using to
ensure cluster availability (ie., auto-restart failed datanodes/namenodes).

Are you using a custom cron script, or maybe something more complex
(Ganglia, Nagios, puppet, etc.)?

Thanks,
Norbert

On 10/28/08, Steve Loughran <[EMAIL PROTECTED]> wrote:
>
> wmitchell wrote:
>
>> Hi All,
>>
>> Ive been working michael nolls multi-node cluster setup example
>> (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
>> then on my slave machine -- which is currently running a datanode killed
>> the
>> process in an effort to try to simulate some sort of failure on the slave
>> machine datanode. I had assumed that the namenode would have been polling
>> its datanodes and thus attempted to bring up any node that goes down. On
>> looking at my slave machine it seems that the datanode process is still
>> down
>> (I've checked jps).
>>
>>
> That's up to you or your management tools. The namenode knows that the
> datanode is unreachable, but doesn't know how to go about reconnecting it to
> the network. Which, given there are many causes of "down", sort of makes
> sense. The switch failing, the hdds dying or the process crashing, all look
> the same: no datanode heartbeats.
>

Reply via email to