[Ubuntu-ha] [Bug 1890491] Re: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189

Jorge Niedbalski Thu, 13 Aug 2020 15:15:56 -0700

I am able to reproduce a similar issue with the following bundle:
https://paste.ubuntu.com/p/VJ3m7nMN79/


Resource created with
sudo pcs resource create test2 ocf:pacemaker:Dummy op_sleep=10 op monitor 
interval=30s timeout=30s op start timeout=30s op stop timeout=30s

juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers 
juju-acda3d-pacemaker-remote-10.cloud.sts"
juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers 
juju-acda3d-pacemaker-remote-11.cloud.sts"
juju ssh nova-cloud-controller/2 "sudo pcs constraint location test2 prefers 
juju-acda3d-pacemaker-remote-12.cloud.sts"


Online: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 
juju-acda3d-pacemaker-remote-9 ]
RemoteOnline: [ juju-acda3d-pacemaker-remote-10.cloud.sts 
juju-acda3d-pacemaker-remote-11.cloud.sts 
juju-acda3d-pacemaker-remote-12.cloud.sts ]

Full list of resources:

Resource Group: grp_nova_vips
res_nova_bf9661e_vip (ocf::heartbeat:IPaddr2): Started 
juju-acda3d-pacemaker-remote-7
Clone Set: cl_nova_haproxy [res_nova_haproxy]
Started: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 
juju-acda3d-pacemaker-remote-9 ]
juju-acda3d-pacemaker-remote-10.cloud.sts (ocf::pacemaker:remote): Started 
juju-acda3d-pacemaker-remote-8
juju-acda3d-pacemaker-remote-12.cloud.sts (ocf::pacemaker:remote): Started 
juju-acda3d-pacemaker-remote-8
juju-acda3d-pacemaker-remote-11.cloud.sts (ocf::pacemaker:remote): Started 
juju-acda3d-pacemaker-remote-7

test2 (ocf::pacemaker:Dummy): Started juju-acda3d-pacemaker-
remote-10.cloud.sts

## After running the following commands on juju-acda3d-pacemaker-
remote-10.cloud.sts

1) sudo systemctl stop pacemaker_remote
2) forcedfully shutdown (openstack server stop xxxx) in less than 10 seconds 
after the pacemaker_remote gets
executed.

Remote is shutdown

RemoteOFFLINE: [ juju-acda3d-pacemaker-remote-10.cloud.sts ]

The resource status remains as stopped across the 3 machines, and
doesn't recovers.

$ juju run --application nova-cloud-controller "sudo pcs resource show | grep 
-i test2"
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/0
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/1
- Stdout: " test2\t(ocf::pacemaker:Dummy):\tStopped\n"
UnitId: nova-cloud-controller/2

However, If I do a clean shutdown (without interrupting the pacemaker_remote 
fence), that ends up
with the resource migrated correctly to another node.

6 nodes configured
9 resources configured

Online: [ juju-acda3d-pacemaker-remote-7 juju-acda3d-pacemaker-remote-8 
juju-acda3d-pacemaker-remote-9 ]
RemoteOnline: [ juju-acda3d-pacemaker-remote-11.cloud.sts 
juju-acda3d-pacemaker-remote-12.cloud.sts ]
RemoteOFFLINE: [ juju-acda3d-pacemaker-remote-10.cloud.sts ]

Full list of resources:

[...]
test2 (ocf::pacemaker:Dummy): Started juju-acda3d-pacemaker-remote-12.cloud.sts

I will keep investigating this behavior and determine is this is linked
to the bug reported.

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1890491

Title:
  A pacemaker node fails monitor (probe) and stop /start operations on a
  resource because it returns "rc=189

Status in pacemaker package in Ubuntu:
  Fix Released
Status in pacemaker source package in Bionic:
  New
Status in pacemaker source package in Focal:
  Fix Released
Status in pacemaker source package in Groovy:
  Fix Released

Bug description:
  Cause: Pacemaker implicitly ordered all stops needed on a Pacemaker
  Remote node before the stop of the node's Pacemaker Remote connection,
  including stops that were implied by fencing of the node. Also,
  Pacemaker scheduled actions on Pacemaker Remote nodes with a failed
  connection so that the actions could be done once the connection is
  recovered, even if the connection wasn't being recovered (for example,
  if the node was shutting down when the failure occurred).

  Consequence: If a Pacemaker Remote node needed to be fenced while it
  was in the process of shutting down, once the fencing completed
  pacemaker scheduled probes on the node. The probes fail because the
  connection is not actually active. Due to the failed probe, a stop is
  scheduled which also fails, leading to fencing of the node again, and
  the situation repeats itself indefinitely.

  Fix: Pacemaker Remote connection stops are no longer ordered after
  implied stops, and actions are not scheduled on Pacemaker Remote nodes
  when the connection is failed and not being started again.

  Result: A Pacemaker Remote node that needs to be fenced while it is in
  the process of shutting down is fenced once, without repeating
  indefinitely.

  The fix seems to be fixed in pacemaker-1.1.21-1.el7

  Related to https://bugzilla.redhat.com/show_bug.cgi?id=1704870

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1890491/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : ubuntu-ha@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

[Ubuntu-ha] [Bug 1890491] Re: A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189

Reply via email to