On 20 June 2013 17:49, Skylar Thompson <[email protected]> wrote:
> We have our own custom Nagios plugins we use. It basically parses the > "qstat -xml" output and looks for hosts that are > disabled/alarming/unavailable. Currently each node check requires a call > to the qmaster which is a lot of overhead, so we only poll every four > hours. We have load sensors running on our exec hosts that will > immediately raise an alarm if the node reports hardware problems > (monitored via local ipmitool calls, and OpenManage for some), disk > space, or out-of-memory conditions (checked via parsing dmesg). This > means that new jobs are immediately prevented from running on those > nodes, so we really just clean stuff up once a day or even less often. > > We have something similar here. Rather than active checks we have a script/daemon that regularly parses the qstat output looking for nodes in an alarmed state,determines the cause of the alarm and pushes the results of to our nagios/opsview server. The nagios/opsview server is configured to flag up a problem if it doesn't receive an update for a service for a while. This means we're only running one qstat command to check the entire cluster so the load on grid engine isn't much. We also push the results to opsview with a single send_nsca command which helps keep the load on the opsview server low. One thing we've noticed is that grid engine sometimes retains old load values for uncontatctable hosts so an explicit check for queues in an uncontactable state is necessary. Somewhere on our to do list is to go the other way: writing a wrapper to convert the output of nagios plugins to grid engine load sensor format so we can use all the nice pre-written nagios plugins to inform grid engine of issues with a node. > Eventually, we'll probably have the qmaster generate a cache of "qstat > -xml" regularly and just parse that. Even longer-term, we'd like to dump > that into a network-accessible message bus so that anything can make use > of those data. > > -- Skylar Thompson ([email protected]) > -- Genome Sciences Department, System Administrator > -- Foege Building S046, (206)-685-7354 > -- University of Washington School of Medicine > > On 06/20/13 09:41, Dave Love wrote: > > Tina Friedrich <[email protected]> writes: > > > >>> Which do you have in mind? I've a nasty feeling I made mods to one > >>> which I've never distributed, but I think it should be changed to do > >>> passive monitoring anyhow, just running qstat/qhost once. > >> > >> Oh. Called "check_sge" (written in python). Was that the one? I never > >> noticed any performance problems as such. > > > > It doesn't cause particular problems on not-very-large clusters here, > > but it's clearly making a lot more calls on qmaster than necessary, > > albeit less demanding ones, plus invocations of the script. > > > >>> Of course, if you just want to check if an execd is alive, you only > need > >>> check_tcp on the right port. > >> > >> True; I think the plugin did/does more than that - checks for error > >> states etc. > > > > Right. I'll try to check what, if anything, I did to it. > > > > I had intended to write notes on monitoring, but have never got round to > > it. If anyone else would like to contribute... > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
