Edward Capriolo wrote:
The simple way would be use use nrpe and check_proc. I have never
tested, but a command like 'ps -ef | grep java  | grep NameNode' would
be a fairly decent check. That is not very robust but it should let
you know if the process is alive.

You could also monitor the web interfaces associated with the
different servers remotely.

check_tcp!hadoop1:56070

Both the methods I suggested are quick hacks. I am going to
investigate the JMX options as well  and work them into cacti

We're developing liveness and pings under a couple of JIRA issues; nothing will be released before 0.20

https://issues.apache.org/jira/browse/HADOOP-3628
https://issues.apache.org/jira/browse/HADOOP-3969

I don't consider hitting the web page a quick hack; for HADOOP-3969 I'd quite like to have the public liveness test a page you can GET or HEAD, as that way it becomes trivial for your existing web page health checking code to pull in all the hadoop services. The best bit: when it fails, the ops team can point their browser at the same URL and see what is up. And if you are a standalone developer -you are the ops team!

-steve

--
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Reply via email to