We're trying to setup cluster monitoring with Nagios but I think we've hit a snag. Our biggest cluster is 890 machines and nagios is chopping the list at ~316, probably due to a character limit for check_cluster arguments.
We wrote another script to divide the cluster nodes into smaller batches to call check_cluster multiple times and tally the results, but it sums the inputs passed from nagios, it doesn't check the cache itself. It looks like an older version was able to specify status.dat and count directly, but the newest does not. At the moment, the most promising path looks like I should be writing a script to parse status.dat and count states by hand, not pretty but it'll work. Is there something else I'm missing that might be easier? For reference, we maintain research computing clusters, several hundred nodes per cluster are common in our environment. I'm trying to monitor each cluster and start pinging our students at certain thresholds, escalating up to paging admins. Thanks -Randy ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
