Thank you very much Allen,
"common-user@ would likely have been better, but I'm too lazy to forward you there today. :)" Thank you :-) "Do you want monitoring information or metrics information? " I need monitoring information. I am working on deploying Hadoop on a small cluster. For now, I am interested in restarting (restart the node or even reboot the OS) the nodes Hadoop detects as crashed. "Instead, one should monitor the namenode and jobtracker and alert based on a percentage of availability. ... " Indeed. I use Hadoop 0.20.203. "This can be done in a variety of ways, ..." Can you please provide any pointers. Do you know how I can access the monitoring information of the namenode or the jobtracker so I can extract a list of failed nodes? Thank you very much for your help P.S.: Why I thought of using metrics information, is because they are periodic and seemed easy to access. I though of using them as heart beats only (i.e. if I do not receive the metric in 2-3 periods I reset the node). Thank you -sam ________________________________ From: Allen Wittenauer <a...@apache.org> To: mapreduce-user@hadoop.apache.org Sent: Tue, July 12, 2011 3:13:42 PM Subject: Re: How to query a slave node for monitoring information On Jul 12, 2011, at 3:02 PM, <samdispmail-tru...@yahoo.com> <samdispmail-tru...@yahoo.com> wrote: > I am new to Hadoop, and I apologies if this was answered before, or if this > is > not the right list for my question. common-user@ would likely have been better, but I'm too lazy to forward you there today. :) > > I am trying to do the following: > 1- Read monitoring information from slave nodes in hadoop > 2- Process the data to detect nodes failure (node crash, problems in requests > ... etc) and decide if I need to restart the whole machine. > 3- Restart the machine running the slave facing problems At scale, one doesn't monitor individual nodes for up/down. Verifying the up/down of a given node will drive you insane and is pretty much a waste of time unless the grid itself is under-configured to the point that *every* *node* *counts*. (If that is the case, then there are bigger issues afoot...) Instead, one should monitor the namenode and jobtracker and alert based on a percentage of availability. This can be done in a variety of ways, depending upon which version of Hadoop is in play. For 0.20.2, a simple screen scrape is good enough. I recommend warn on 10%, alert on 20%, panic on 30%. > My question is for step 1- collecting monitoring information. > I have checked Hadoop monitoring features. But currently you can forward the > motioning data to files, or to Ganglia. Do you want monitoring information or metrics information? Ganglia is purely a metrics tool. Metrics are a different animal. While it is possible to alert on them, in most cases they aren't particular useful in a monitoring context other than up/down.