I'm trying to implement Nagios health monitoring of a Hadoop grid. If anyone has general tips to share, those would be welcome, too. For those who don't know, Nagios is monitoring software that organizes and manages checking of services.
As best as I know, the easiest, most decoupled way to monitor the grid is to use a script to parse the jobtracker and tasktracker JSPs that are served when the Hadoop instance is running. My original implementation was 1 script that pointed to the 2 jsps on the primary namenode. However, this led to serious performance hangups from Nagios' bombarding the primary node with frequent checks. To fix this, I'd like to distribute the script to each Hadoop datanode, so that Nagios is polling each node directly, instead of always going through the primary node and making it do all of the work for the whole grid. The problem is with job info. I can't think of a way to ask a datanode for this, since it doesn't serve the jobtracker.jsp. Only the namenode serves that jsp. Is there 1) a better way to get this info? I'm scripting in perl, so writing a custom jar to find out things would be rather convoluted. 2) a straightforward way to get job status from a namenode directly? Thanks!
