> In this case, the agent is returning "master (failed)", which does not > mean that it previously failed when it was master -- it means it is > currently running as master, in a failed condition.
Well, it surely is NOT running. So the likely problem is the way it's doing this check? I see a lot of people here using PAF - I'd be surprised if such a bug weren't discovered already... > Stopping an already stopped service does not return an error -- here, > the agent is saying it was unable to demote or stop a running instance. I still don't understand. There are *NO* postgres processes running on the node, no additions to the log file. Nothing whatsoever that supports the notion that it's a running instance. > Unfortunately clustering has some inherent complexity that gives it a > steep learning curve. On top of that, logging/troubleshooting > improvements are definitely an area of ongoing need in pacemaker. The > good news is that once a cluster is running successfully, it's usually > smooth sailing after that. I hope so... I just don't see what I'm doing that's outside of the standard box. I've set up PAF following it's instructions. I see that others here are using it. Hasn't anybody else gotten such a setup working already? I would think this is a pretty standard failure case that anybody would test if they've set up a cluster... In any case, I'll keep persisting as long as I can...on to debugging... > You can debug like this: > > 1. Unmanage the resource in pacemaker, so you can mess with it > manually. > > 2. Cause the desired failure for testing. Pacemaker should detect the > failure, but not do anything about it. I executed `pcs resource unmanage postgresql-ha`, and then powered off the master node. The fencing kicked in and restarted the node. After the node rebooted, I issued a `pcs cluster start` on it as the crm_resource command complained about the CIB without doing that. I then ended up seeing this: ------ vfencing (stonith:external/vcenter): Started d-gp2-dbpg0-1 postgresql-master-vip (ocf::heartbeat:IPaddr2): Started d-gp2-dbpg0-2 Master/Slave Set: postgresql-ha [postgresql-10-main] (unmanaged) postgresql-10-main (ocf::heartbeat:pgsqlms): Started d-gp2-dbpg0-3 (unmanaged) postgresql-10-main (ocf::heartbeat:pgsqlms): Slave d-gp2-dbpg0-1 (unmanaged) postgresql-10-main (ocf::heartbeat:pgsqlms): FAILED Master d-gp2-dbpg0-2 (unmanaged) Failed Actions: * postgresql-10-main_monitor_0 on d-gp2-dbpg0-2 'master (failed)' (9): call=14, status=complete, exitreason='Instance "postgresql-10-main" controldata indicates a running primary instance, the instance has probably crashed', last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=190ms * postgresql-10-main_monitor_15000 on d-gp2-dbpg0-2 'master (failed)' (9): call=16, status=complete, exitreason='Instance "postgresql-10-main" controldata indicates a running primary instance, the instance has probably crashed', last-rc-change='Wed May 30 22:18:16 2018', queued=0ms, exec=138ms ------ > 3. Run crm_resource with the -VV option and --force-* with whatever > action you want to attempt (in this case, demote or stop). The -VV (aka > --verbose --verbose) will turn on OCF_TRACE_RA. The --force-* command > will read the resource configuration and do the same thing pacemaker > would do to execute the command. I thought that I would want to see what the "check" is doing to do the check, since you're telling me that it thinks the service is running when it's definitely not. I tried the following command which didn't work (am I doing something wrong?): ------ root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-check warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) Operation monitor for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5 > stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match > (m//) at > /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line > 392. > stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and > greater Error performing operation: Input/output error ------ Attempting to force-demote didn't work either: ------ root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) resource postgresql-ha is running on: d-gp2-dbpg0-3 resource postgresql-ha is running on: d-gp2-dbpg0-1 resource postgresql-ha is running on: d-gp2-dbpg0-2 Master It is not safe to demote postgresql-ha here: the cluster claims it is already active Try setting target-role=stopped first or specifying --force root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-demote --force warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) resource postgresql-ha is running on: d-gp2-dbpg0-3 resource postgresql-ha is running on: d-gp2-dbpg0-1 resource postgresql-ha is running on: d-gp2-dbpg0-2 Master Operation demote for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5 > stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match > (m//) at > /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line > 392. > stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and > greater Error performing operation: Input/output error ------ Neither did force-stop: ------ root@d-gp2-dbpg0-2:/var# crm_resource -r postgresql-ha -VV --force-stop warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) warning: unpack_rsc_op_failure: Processing failed op monitor for postgresql-10-main:2 on d-gp2-dbpg0-2: master (failed) (9) Operation stop for postgresql-10-main:0 (ocf:heartbeat:pgsqlms) returned 5 > stderr: Use of uninitialized value $OCF_Functions::ARG[0] in pattern match > (m//) at > /usr/lib/ocf/resource.d/heartbeat/../../lib/heartbeat/OCF_Functions.pm line > 392. > stderr: ocf-exit-reason:PAF v2.2.0 is compatible with Pacemaker 1.1.13 and > greater Error performing operation: Input/output error ------ Thanks, -- Casey _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org