Re: nagios to monitor hadoop datanodes!
All I have to say is wow! I never tried jconsole before. I have hadoop_trunk checked out and the JMX has all kinds of great information. I am going to look at how I can get JMX/cacti/and hadoop working together. Just as an FYI there are separate ENV variables for each now. If you override hadoop_ops you get a port conflict. It should be like this. export HADOOP_NAMENODE_OPTS=-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=10001 Thanks Brian.
Re: nagios to monitor hadoop datanodes!
The simple way would be use use nrpe and check_proc. I have never tested, but a command like 'ps -ef | grep java | grep NameNode' would be a fairly decent check. That is not very robust but it should let you know if the process is alive. You could also monitor the web interfaces associated with the different servers remotely. check_tcp!hadoop1:56070 Both the methods I suggested are quick hacks. I am going to investigate the JMX options as well and work them into cacti
Re: nagios to monitor hadoop datanodes!
Hey Edward, The JMX documentation for Hadoop is non-existent, but here's about what you need to do: 1) download and install the check_jmx Nagios plugin 2) Open up the hadoop JMX install to the outside world. I added the following lines to hadoop-env.sh export HADOOP_OPTS= -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=8004 Note the potential security issue I'm opening up. You could also switch things to SSL auth, but I have not explored that thoroughly in combination with Nagios. 3) Restart Hadoop 3) Use jconsole to connect to Hadoop's JVM. Look in the MBeans tab and decide what metrics you want to monitor. If you look at the info tab (the last on the right), you'll see the MBean Name; you'll need to remember this later. 4) Add Nagios probes like so: ./check_jmx -U service:jmx:rmi:///jndi/rmi://node182:8004/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K used -C 1000 This connects to node182 on port 8004. It then looks at the Memory statistics (java.lang:type=Memory), at the HeapMemory attribute, and the used field inside that attribute (in jconsole, if you see a value in bold, you need to double-click to expand its contents). I then set the critical level of the metric to be 1 bytes of memory used and warning level to 1000 bytes. The result is like this: [EMAIL PROTECTED] plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// node182:8004/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K used -w 1000 -c 1 JMX OK HeapMemoryUsage.used=9780336 If I poked a dead JVM (or change to the wrong port), I get the following: [EMAIL PROTECTED] plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// node182:8005/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K used -w 1000 -c 1 JMX CRITICAL Connection refused If I lower the critical level to below the current usage, you get: [EMAIL PROTECTED] plugin]$ ./check_jmx -U service:jmx:rmi:///jndi/rmi:// node182:8004/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K used -w 10 -c 100 JMX CRITICAL HeapMemoryUsage.used=4846000 THE BIG PROBLEM here is that Hadoop decides to hide a lot of interesting data node statistics behind a random name; want the max time it took to do the block reports? For me, the query looks like this: [EMAIL PROTECTED] plugin]# ./check_jmx -U service:jmx:rmi:///jndi/rmi:// node182:8004/jmxrmi -O hadoop.dfs:service=DataNode- DS-1394617310-172.16.1.182-50010-122278610129,name=DataNodeStatistics - A BlockReportsMaxTime -w 10 -c 100 JMX CRITICAL hadoop.dfs:service=DataNode- DS-1394617310-172.16.1.182-50010-122278610129,name=DataNodeStatistics Here the service is called DataNode- DS-1394617310-172.16.1.182-50010-122278610129, which really causes Hadoop to shoot itself in the foot with regards to Nagios monitoring. Locally, we patch things so the random string goes away: [EMAIL PROTECTED] plugin]# ./check_jmx -U service:jmx:rmi:///jndi/rmi:// node182:8004/jmxrmi -O hadoop.dfs:service=DataNode,name=DataNodeStatistics -A BlockReportsMaxTime -w 10 -c 150 JMX WARNING BlockReportsMaxTime=141 Care to file a bug for that anyone? I assume you can set up Nagios from there. Brian On Oct 8, 2008, at 8:20 AM, Edward Capriolo wrote: The simple way would be use use nrpe and check_proc. I have never tested, but a command like 'ps -ef | grep java | grep NameNode' would be a fairly decent check. That is not very robust but it should let you know if the process is alive. You could also monitor the web interfaces associated with the different servers remotely. check_tcp!hadoop1:56070 Both the methods I suggested are quick hacks. I am going to investigate the JMX options as well and work them into cacti
Re: nagios to monitor hadoop datanodes!
Edward Capriolo wrote: The simple way would be use use nrpe and check_proc. I have never tested, but a command like 'ps -ef | grep java | grep NameNode' would be a fairly decent check. That is not very robust but it should let you know if the process is alive. You could also monitor the web interfaces associated with the different servers remotely. check_tcp!hadoop1:56070 Both the methods I suggested are quick hacks. I am going to investigate the JMX options as well and work them into cacti We're developing liveness and pings under a couple of JIRA issues; nothing will be released before 0.20 https://issues.apache.org/jira/browse/HADOOP-3628 https://issues.apache.org/jira/browse/HADOOP-3969 I don't consider hitting the web page a quick hack; for HADOOP-3969 I'd quite like to have the public liveness test a page you can GET or HEAD, as that way it becomes trivial for your existing web page health checking code to pull in all the hadoop services. The best bit: when it fails, the ops team can point their browser at the same URL and see what is up. And if you are a standalone developer -you are the ops team! -steve -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: nagios to monitor hadoop datanodes!
That all sounds good. By 'quick hack' I meant 'check_tcp' was not good enough because an open TCP socket does not prove much. However, if the page returns useful attributes that show cluster is alive that is great and easy. Come to think of it you can navigate the dfshealth page and get useful information from it.
Re: nagios to monitor hadoop datanodes!
try jmx. There should be also jmx to snmp available somewhere. http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: nagios to monitor hadoop datanodes!
Hey Stefan, Is there any documentation for making JMX working in Hadoop? Brian On Oct 7, 2008, at 7:03 PM, Stefan Groschupf wrote: try jmx. There should be also jmx to snmp available somewhere. http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: nagios to monitor hadoop datanodes!
Hadoop already integrated jmx inside, you can extend them to implement what you want to monitor, it need to modify some code to add some counters or something like that. One thing you may need to be care is hadoop does not include any JMXConnectorServer inside, you need to start one JMXConnectorServer for every hadoop process you want to monitor. This is what we have done on hadoop to monitor it. We have not check out Nagios for hadoop,so no word on Nagios. hope it help. 在 2008-10-8,上午8:34,Brian Bockelman 写道: Hey Stefan, Is there any documentation for making JMX working in Hadoop? Brian On Oct 7, 2008, at 7:03 PM, Stefan Groschupf wrote: try jmx. There should be also jmx to snmp available somewhere. http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
nagios to monitor hadoop datanodes!
Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: nagios to monitor hadoop datanodes!
The easiest approach I can think of is to write a simple Nagios plugin that checks if the datanode JVM process is alive. Or you may write a Nagios-plugin that checks for error or warning messages in datanode logs. (I am sure you can find quite a few log-checking Nagios plugin in nagiosplugin.org) If you are unsure of how to write nagios-plugin, I suggest you to read stuff from link Leverage Nagios with plug-ins you write http://www.ibm.com/developerworks/aix/library/au-nagios/ as it's got good explanations and examples on how to write nagios plugin. Or if you've got time to burn, you might want to read Nagios documentation, too. Let me know if you need help on this matter. /Taeho On Tue, Oct 7, 2008 at 2:05 AM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo