Re: nagios to monitor hadoop datanodes!

2008-10-29 Thread Edward Capriolo
All I have to say is wow! I never tried jconsole before. I have
hadoop_trunk checked out and the JMX has all kinds of great
information. I am going to look at how I can get JMX/cacti/and hadoop
working together.

Just as an FYI there are separate ENV variables for each now. If you
override hadoop_ops you get a port conflict. It should be like this.

export HADOOP_NAMENODE_OPTS=-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=10001

Thanks Brian.


Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Edward Capriolo
The simple way would be use use nrpe and check_proc. I have never
tested, but a command like 'ps -ef | grep java  | grep NameNode' would
be a fairly decent check. That is not very robust but it should let
you know if the process is alive.

You could also monitor the web interfaces associated with the
different servers remotely.

check_tcp!hadoop1:56070

Both the methods I suggested are quick hacks. I am going to
investigate the JMX options as well  and work them into cacti


Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Brian Bockelman

Hey Edward,

The JMX documentation for Hadoop is non-existent, but here's about  
what you need to do:


1) download and install the check_jmx Nagios plugin
2) Open up the hadoop JMX install to the outside world.  I added the  
following lines to hadoop-env.sh
export HADOOP_OPTS= -Dcom.sun.management.jmxremote.authenticate=false  
-Dcom.sun.management.jmxremote.port=8004 


Note the potential security issue I'm opening up.  You could also  
switch things to SSL auth, but I have not explored that thoroughly in  
combination with Nagios.

3) Restart Hadoop
3) Use jconsole to connect to Hadoop's JVM.  Look in the MBeans tab  
and decide what metrics you want to monitor.  If you look at the  
info tab (the last on the right), you'll see the MBean Name; you'll  
need to remember this later.

4) Add Nagios probes like so:
./check_jmx -U service:jmx:rmi:///jndi/rmi://node182:8004/jmxrmi -O  
java.lang:type=Memory -A HeapMemoryUsage -K used -C 1000
This connects to node182 on port 8004.  It then looks at the Memory  
statistics (java.lang:type=Memory), at the HeapMemory attribute, and  
the used field inside that attribute (in jconsole, if you see a value  
in bold, you need to double-click to expand its contents).  I then set  
the critical level of the metric to be 1 bytes of memory used  
and warning level to 1000 bytes.


The result is like this:

[EMAIL PROTECTED] plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8004/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K  
used -w 1000 -c 1

JMX OK HeapMemoryUsage.used=9780336

If I poked a dead JVM (or change to the wrong port), I get the  
following:


[EMAIL PROTECTED] plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8005/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K  
used -w 1000 -c 1

JMX CRITICAL Connection refused

If I lower the critical level to below the current usage, you get:

[EMAIL PROTECTED] plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8004/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K  
used -w 10 -c 100

JMX CRITICAL HeapMemoryUsage.used=4846000


THE BIG PROBLEM here is that Hadoop decides to hide a lot of  
interesting data node statistics behind a random name; want the max  
time it took to do the block reports?  For me, the query looks like  
this:


[EMAIL PROTECTED] plugin]# ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8004/jmxrmi -O hadoop.dfs:service=DataNode- 
DS-1394617310-172.16.1.182-50010-122278610129,name=DataNodeStatistics - 
A BlockReportsMaxTime -w 10 -c 100
JMX CRITICAL hadoop.dfs:service=DataNode- 
DS-1394617310-172.16.1.182-50010-122278610129,name=DataNodeStatistics


Here the service is called DataNode- 
DS-1394617310-172.16.1.182-50010-122278610129, which really causes  
Hadoop to shoot itself in the foot with regards to Nagios monitoring.   
Locally, we patch things so the random string goes away:


[EMAIL PROTECTED] plugin]# ./check_jmx -U service:jmx:rmi:///jndi/rmi:// 
node182:8004/jmxrmi -O  
hadoop.dfs:service=DataNode,name=DataNodeStatistics -A  
BlockReportsMaxTime -w 10 -c 150

JMX WARNING BlockReportsMaxTime=141

Care to file a bug for that anyone?

I assume you can set up Nagios from there.

Brian

On Oct 8, 2008, at 8:20 AM, Edward Capriolo wrote:


The simple way would be use use nrpe and check_proc. I have never
tested, but a command like 'ps -ef | grep java  | grep NameNode' would
be a fairly decent check. That is not very robust but it should let
you know if the process is alive.

You could also monitor the web interfaces associated with the
different servers remotely.

check_tcp!hadoop1:56070

Both the methods I suggested are quick hacks. I am going to
investigate the JMX options as well  and work them into cacti




Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Steve Loughran

Edward Capriolo wrote:

The simple way would be use use nrpe and check_proc. I have never
tested, but a command like 'ps -ef | grep java  | grep NameNode' would
be a fairly decent check. That is not very robust but it should let
you know if the process is alive.

You could also monitor the web interfaces associated with the
different servers remotely.

check_tcp!hadoop1:56070

Both the methods I suggested are quick hacks. I am going to
investigate the JMX options as well  and work them into cacti


We're developing liveness and pings under a couple of JIRA issues; 
nothing will be released before 0.20


https://issues.apache.org/jira/browse/HADOOP-3628
https://issues.apache.org/jira/browse/HADOOP-3969

I don't consider hitting the web page a quick hack; for HADOOP-3969 I'd 
quite like to have the public liveness test a page you can GET or HEAD, 
as that way it becomes trivial for your existing web page health 
checking code to pull in all the hadoop services. The best bit: when it 
fails, the ops team can point their browser at the same URL and see what 
is up. And if you are a standalone developer -you are the ops team!


-steve

--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Edward Capriolo
That all sounds good. By 'quick hack'  I meant 'check_tcp' was not
good enough because an open TCP socket does not prove much. However,
if the page returns useful attributes that show cluster is alive that
is great and easy.

Come to think of it you can navigate the dfshealth page and get useful
information from it.


Re: nagios to monitor hadoop datanodes!

2008-10-07 Thread Stefan Groschupf

try jmx. There should be also jmx to snmp available somewhere.
http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote:


Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or  
advice I

could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo




Re: nagios to monitor hadoop datanodes!

2008-10-07 Thread Brian Bockelman

Hey Stefan,

Is there any documentation for making JMX working in Hadoop?

Brian

On Oct 7, 2008, at 7:03 PM, Stefan Groschupf wrote:


try jmx. There should be also jmx to snmp available somewhere.
http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote:


Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or  
advice I

could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo




Re: nagios to monitor hadoop datanodes!

2008-10-07 Thread 何永强
Hadoop already integrated jmx inside, you can extend them to  
implement what you want to monitor, it need to modify some code to  
add some counters or something like that.
One thing you may need to be care is hadoop does not include any  
JMXConnectorServer inside, you need to start one JMXConnectorServer  
for every hadoop process you want to monitor.
This is what we have done on hadoop to monitor it. We have not check  
out Nagios  for hadoop,so no word on Nagios.

hope it help.
在 2008-10-8,上午8:34,Brian Bockelman 写道:


Hey Stefan,

Is there any documentation for making JMX working in Hadoop?

Brian

On Oct 7, 2008, at 7:03 PM, Stefan Groschupf wrote:


try jmx. There should be also jmx to snmp available somewhere.
http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote:


Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach  
or advice I

could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo








nagios to monitor hadoop datanodes!

2008-10-06 Thread Gerardo Velez
Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or advice I
could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo


Re: nagios to monitor hadoop datanodes!

2008-10-06 Thread Taeho Kang
The easiest approach I can think of is to write a simple Nagios plugin that
checks if the datanode JVM process is alive. Or you may
write a Nagios-plugin that checks for error or warning messages in datanode
logs. (I am sure you can find quite a few log-checking Nagios plugin in
nagiosplugin.org)

If you are unsure of how to write nagios-plugin, I suggest you to read stuff
from link Leverage Nagios with plug-ins you write
http://www.ibm.com/developerworks/aix/library/au-nagios/ as it's got good
explanations and examples on how to write nagios plugin.

Or if you've got time to burn, you might want to read Nagios documentation,
too.

Let me know if you need help on this matter.

/Taeho



On Tue, Oct 7, 2008 at 2:05 AM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi Everyone!


 I would like to implement Nagios health monitoring of a Hadoop grid.

 Some of you have some experience here, do you hace any approach or advice I
 could use.

 At this time I've been only playing with jsp's files that hadoop has
 integrated into it. so I;m not sure if it could be a good idea that
 nagios monitor request info to these jsp?


 Thanks in advance!


 -- Gerardo