Thank you very much Allen,

"common-user@ would likely have been better, but I'm too lazy to forward you 
there today. :)"
Thank you :-)

"Do you want monitoring information or metrics information? "
I need monitoring information. 
I am working on deploying Hadoop on a small cluster. For now, I am interested 
in 
restarting (restart the node or even reboot the OS) the nodes Hadoop detects as 
crashed.

"Instead, one should monitor the namenode and jobtracker and alert based on a 
percentage of availability.  ... "
Indeed.
I use Hadoop 0.20.203.

"This can be done in a variety of ways, ..."
Can you please provide any pointers.
Do you know how I can access the monitoring information of the namenode or the 
jobtracker so I can extract a list of failed  nodes?

Thank you very much for your help

P.S.:
Why I thought of using metrics information, is because they are periodic and 
seemed easy to access. I though of using them as heart beats only (i.e. if I do 
not receive the metric in 2-3 periods I reset the node).

Thank you 

-sam



________________________________
From: Allen Wittenauer <a...@apache.org>
To: mapreduce-user@hadoop.apache.org
Sent: Tue, July 12, 2011 3:13:42 PM
Subject: Re: How to query a slave node for monitoring information

On Jul 12, 2011, at 3:02 PM, <samdispmail-tru...@yahoo.com>
<samdispmail-tru...@yahoo.com> wrote:
> I am new to Hadoop, and I apologies if this was answered before, or if this 
> is 

> not the right list for my question.

    common-user@ would likely have been better, but I'm too lazy to forward you 
there today. :)

> 
> I am trying to do the following:
> 1- Read monitoring information from slave nodes in hadoop
> 2- Process the data to detect nodes failure (node crash, problems in requests 
> ... etc) and decide if I need to restart the whole machine.
> 3- Restart the machine running the slave facing problems


    At scale, one doesn't monitor individual nodes for up/down.  Verifying  the 
up/down of a given node will drive you insane and is pretty much a waste of 
time 
unless the grid itself is under-configured to the point that *every* *node* 
*counts*.  (If that is the case, then there are bigger issues afoot...)

    Instead, one should monitor the namenode and jobtracker and alert based on 
a 
percentage of availability.  This can be done in a variety of ways, depending 
upon which version of Hadoop is in play.  For 0.20.2, a simple screen scrape is 
good enough.  I recommend warn on 10%, alert on 20%, panic on 30%.

> My question is for step 1- collecting monitoring information.
> I have checked Hadoop monitoring features. But currently you can forward the 
> motioning data to files, or to Ganglia.

    
    Do you want monitoring information or metrics information?  Ganglia is 
purely a metrics tool.  Metrics are a different  animal.  While it is possible 
to alert on them, in most cases they aren't particular useful in a monitoring 
context other than up/down.

Reply via email to