On 02/27/2016 03:56 PM, Devin Reade wrote: > Right now in a test cluster on CentOS 7 I'm occasionally seeing > resource monitoring failures and, just today, a failure to start > a fencing agent. While I need to track those down problems, the > issue I want to discuss here is being notified when there is a > problem with the cluster, where there is not a nagios-type monitoring > system in place. > > On an older CentOS 5 cluster I have a cron job that periodically runs > 'crm_verify -LV'. If the return code is non-zero, the output of > that command (and some other info) is mailed to the operator. That > mechanism has been working well for years. > > However on CentOS 7, when the cluster gets into this state 'crm_verify -LV' > returns zero, and its output claims there is no problem. However in > 'crm_mon -f' I can see that I've got resource failures and nonzero > failcounts. > > I tried 'pcs cluster status', however when the cluster is properly > working (no failures), that command still has a return code of '1', > probably because I get the 'Error: no nodes found in corosync.conf' > which is an ignorable condition per > <https://access.redhat.com/solutions/663283>. > > Is there a command that I can run from cron in the current cluster > tools to tell me the simple answer of whether there is *anything* > failed in the cluster, preferably based on its return code?
I'm not sure about the CentOS 5 days, but at least now, crm_verify is intended to verify the syntax of a cluster's configuration rather than its status. The simplest method is "crm_mon -s", which gives a one-line nagios-compatible output with return code 0=success and 1=problem. However. it returns 1 for cluster not running, no DC, or offline nodes. Back in the day, I used check_crm with nagios/icinga. It's a perl script that parses the output of crm_mon -1rf and crm configure show. It's trivial to use such a check outside a monitoring system, and it could be modified to work with pcs and current crm_mon output, so maybe it could help: https://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/Check-CRM/details > The CentOS 7 cluster is running: > corosync 2.3.4 > pacemaker 1.1.13 > > The CentOS 5 cluster is running: > corosync 1.2.7 > pacemaker 1.0.12 > > The corosync.conf is included below: > > --------- cut here and be careful of pointy scissors --------- > totem { > version: 2 > #secauth: off > cluster_name: somecluster > #transport: udpu > rrp_mode: passive > crypto_hash: sha256 > clear_node_high_bit: yes > > interface { > ringnumber: 0 > bindnetaddr: 192.168.1.0 > mcastaddr: 239.192.0.5 > mcastport: 5406 > } > interface { > ringnumber: 1 > bindnetaddr: 192.168.2.0 > mcastaddr: 239.192.0.6 > mcastport: 5408 > } > } > > quorum { > provider: corosync_votequorum > two_node: 1 > expected_votes: 2 > } > > logging { > to_syslog: yes > } > > --------- cut here and be careful of pointy scissors --------- > > Devin _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org