On Wed, 20 Apr 2016 at 14:14 -0000, Cam Karnes wrote:
> Been scouring the internet for a while looking for good examples of
> NHC [https://github.com/mej/nhc] implementations on HPC clusters
> using SoGE for resource management. I’ve found a few posts here and
> there via Google Groups, and Dave Love’s writeup detailing the usage
> of NHC as a load sensor for SoGE.
We still use nhc for health monitoring in our SoGE installations.
You may have found some of my notes in your searches. We are still
successfully using the process described in Message-IDs
<[email protected]> and
<[email protected]>.
My main additions to our grid engine was addition of two new variables
to the data 'complex'. Variable "HEALTHY" is a logical variable and
used in the exec node load_formula to disable un-healthy nodes.
Variable "DIAGNOSIS" is a string variable with details about the first
error found by nhc.
I see that nhc has had some updates and incorporated some aspects of
my load sensor script. We should probably upgrade to a later nhc for
our use.
It also looks like the grid engine support doesn't have any
documentation on use and could use some of the information from the
above emails.
> We’re currently trying to use NHC to get useful snapshots of the
> states of our HPC nodes. The checks and configurations of NHC aren’t
> a problem, but the framework itself definitely looks to be geared
> more towards SLURM and TORQUE.
Yes, it is definitely most oriented towards those. It can get tricky
to supply automatic glue into multiple systems, but I found nhc to be
a good set of tools for node health checks.
> We currently have two goals.
>
> 1) Have NHC act as the standard health checking mechanism for all
> clustered devices, wherein a check is initiated upstream by a
> centralized monitoring service (Zabbix, Nagios, Cacti, etc) and then
> return values and diagnostic messages are consumed by the same
> monitoring service, where the node’s state will be reflected (wrong
> mounts, over limit filesystems, memory free, etc).
This sounds good and we should also consider integrating nhc checks
into our more recent icinga installation. In our current setup icinga
and grid engine can have different opinions about the health of a
node. We also have two different systems doing work to check issues
and on compute nodes I would really prefer to only check things with
one system.
> 2) Any failed check will result in disabling all queues for that
> node.
Without running the nhc checks twice on each node, what you need to
look at is a centralized load sensor which hooks into your monitoring
service.
> NHC is working, and the configuration files we have in place are
> running some basic checks that reflect unhealthy states, but the
> integration and automation of the service with SoGE is where we’re
> left scratching our heads a little. Sourcing NHC with certain
> environmental variables set and running individual NHC functions
> with a script upstream is one option we’ve explored.
> There are some other things we’ve noticed that strike us as a little
> strange. For example, running NHC seems to require some STDIN when
> SoGE is detected as the resource manager. There are certain
> environmental variables set by NHC, as well, like TIMEOUT. If SoGE
> is detected, then NHC will set this to 0 regardless of what has been
> specified without the timeout flag. This breaks NHC functions like
> check_cmd_output.
This looks like more recent integration of the grid engine support
into nhc. I have not used this more recent version and suspect it may
conflict with integrating nhc into other monitoring systems.
> Hopefully this wasn’t too NHC specific of a post and thanks in
> advance.
I think you may have two distinct problems you need to address (which
are different from our current usage):
- Get information from nhc into your monitoring system. You probably
want to ignore grid engine support for this and look at other ways of
running nhc. You might have Nagios run nhc every 5 minutes and
process any output. I'm not sure how well the nhc output would
integrate into a Nagios style monitor. Lots of hand waving here.
- Get information from your monitoring system into grid engine. You
want to look at grid engine load sensors and might get some ideas from
my other emails. My implementation ran the load sensor (and thus nhc)
on each exec node, but if your monitoring system has already
aggregated the information you may want to run a central load sensor
(on your qmaster system) which updated load (health) information for
all nodes at one time. Some hand waving, but I think it is not hard
once understood.
I think what you want to do is doable. It mostly involves
understanding several different concepts.
Stuart
--
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users