Edward Capriolo wrote:
The simple way would be use use nrpe and check_proc. I have never
tested, but a command like 'ps -ef | grep java | grep NameNode' would
be a fairly decent check. That is not very robust but it should let
you know if the process is alive.
You could also monitor the web interfaces associated with the
different servers remotely.
check_tcp!hadoop1:56070
Both the methods I suggested are quick hacks. I am going to
investigate the JMX options as well and work them into cacti
We're developing liveness and pings under a couple of JIRA issues;
nothing will be released before 0.20
https://issues.apache.org/jira/browse/HADOOP-3628
https://issues.apache.org/jira/browse/HADOOP-3969
I don't consider hitting the web page a quick hack; for HADOOP-3969 I'd
quite like to have the public liveness test a page you can GET or HEAD,
as that way it becomes trivial for your existing web page health
checking code to pull in all the hadoop services. The best bit: when it
fails, the ops team can point their browser at the same URL and see what
is up. And if you are a standalone developer -you are the ops team!
-steve
--
Steve Loughran http://www.1060.org/blogxter/publish/5
Author: Ant in Action http://antbook.org/