On Mon, 2021-01-18 at 14:10 -0500, Digimer wrote: > On 2021-01-18 1:52 p.m., Ken Gaillot wrote: > > On Sun, 2021-01-17 at 21:11 -0500, Digimer wrote: > > > Hi all, > > > > > > Mind the slew of questions, well into testing now and finding > > > lots > > > of > > > issues. This one is two questions... :) > > > > > > I set a server to be unamaged in pacemaker while the server was > > > running. Then I tried to remove the resource, and it refused > > > saying > > > it > > > couldn't stop it, and to use '--force'. So I did, and the node > > > got > > > fenced. Now, the resource was setup with; > > > > > > pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" > > > \ > > > meta allow-migrate="true" target-role="started" \ > > > op monitor interval="60" start timeout="INFINITY" \ > > > on-fail="block" stop timeout="INFINITY" on-fail="block" \ > > > migrate_to timeout="INFINITY" > > > > > > I would have expected the 'stop timeout="INFINITY" on- > > > fail="block"' > > > to > > > prevent fencing if the server failed to stop (question 1) and > > > that if > > > a > > > resource was unmanaged, that the resource wouldn't even try to > > > stop > > > (question 2). > > > > It would have ... if you hadn't removed it :) > > Ahahaha! OK, I see now why it happened. I should have realized this. > :) > > > Removing a resource means removing its configuration entry. > > Pacemaker > > memorizes the resource agent parameters used to start a resource, > > so it > > can still execute a stop when the configuration is removed. However > > currently it does not memorize any resource meta-attributes or > > operations that were configured, so those are lost when the > > resource is > > removed. > > > > We maybe should remember more information about the resource than > > just > > its agent parameters, but it should probably be selective rather > > than a > > broad net. > > > > FYI, the stop-orphan-resources property controls whether a resource > > whose configuration is removed should be stopped (defaulting to > > true). > > Oh! That's excellent. So if I know I want to remove a resource from > pacemaker, but keep it running, I can update the resource's config to > set that, then delete the resource and the service will still keep > running?
Correct, though it's a cluster-wide property not a resource property > Do you know off hand what the pcs command would be to set that? No > worries if not, I can sort it out easily enough I am sure. pcs property set > > digimer > > > > Can someone help me understand what happened here? > > > > > > digimer > > > > > > More below; > > > > > > ==== > > > [root@el8-a01n01 ~]# pcs resource remove srv01-test > > > Attempting to stop: srv01-test... Warning: 'srv01-test' is > > > unmanaged > > > Error: Unable to stop: srv01-test before deleting (re-run with -- > > > force > > > to force deletion) > > > [root@el8-a01n01 ~]# pcs resource remove srv01-test --force > > > Deleting Resource - srv01-test > > > > Note that the above is all within pcs, which tries to explicitly > > stop > > the resource before removing it. That gives pcs more control over > > the > > process and would ensure that the meta-attributes and operations > > were > > still present for the stop. But as it warns, it can't do that if > > the > > resource is unmanaged, so then it's left to Pacemaker to stop the > > resource after the configuration is gone. > > > > BTW stop-orphan-resources takes precedence over whether the > > resource is > > managed, so Pacemaker does try to stop it even though the node is > > still > > unmanaged. > > > > > [root@el8-a01n01 ~]# client_loop: send disconnect: Broken pipe > > > ==== > > > > > > As you can see, the node was fenced. The logs on that node > > > were; > > > > > > ==== > > > Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker- > > > execd[1872]: warning: > > > srv01-test_stop_0 process (PID 113779) timed out > > > Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker- > > > execd[1872]: warning: > > > srv01-test_stop_0[113779] timed out after 20000ms > > > Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker- > > > controld[1875]: error: > > > Result of stop operation for srv01-test on el8-a01n01: Timed Out > > > > A stop timeout causes fencing. I'm guessing what happened in this > > case > > is that the stop takes longer than 20s, which is the default > > operation > > timeout, which has to be used because the original operation > > configuration was removed. > > > > The main problem here is that removing a resource from the > > configuration means that its stop timeout is lost, so if it's > > active > > when it's removed, the stop might time out. pcs's approach of > > stopping > > it first (by setting target-role to Stopped) makes sense, but > > having > > the node unmanaged blocks that. It's a catch-22 for the cluster > > that > > you'd have to resolve by changing one of those conditions. > > > > For example you could unmanage the other resources on the node and > > re- > > manage the node, before stopping the desired resource then removing > > it. > > Or, you could temporarily set a long default timeout for all > > operations > > (i.e. in operation defaults) so that gets used instead of 20s. Or, > > you > > could temporarily set stop-orphan-resources to false, remove the > > resource, then manually stop the instance. > > > > > Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker- > > > controld[1875]: notice: > > > el8-a01n01-srv01-test_stop_0:37 [ The server: [srv01-test] is > > > indeed > > > running. It will be shut down now.\n ] > > > Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker- > > > attrd[1873]: notice: > > > Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> > > > INFINITY > > > Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker- > > > attrd[1873]: notice: > > > Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> > > > 1610935435 > > > Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker- > > > attrd[1873]: notice: > > > Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> > > > (unset) > > > Jan 18 02:03:55 el8-a01n01.alteeve.ca pacemaker- > > > attrd[1873]: notice: > > > Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> > > > (unset) > > > client_loop: send disconnect: Broken pipe > > > ==== > > > > > > On the peer node, the logs showed; > > > > > > ==== > > > Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: State transition S_IDLE -> S_POLICY_ENGINE > > > Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Calculated transition 58, saving inputs in > > > /var/lib/pacemaker/pengine/pe-input-100.bz2 > > > Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 58 (Complete=0, Pending=0, Fired=0, Skipped=0, > > > Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input- > > > 100.bz2): > > > Complete > > > Jan 18 02:03:13 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: State transition S_TRANSITION_ENGINE -> S_IDLE > > > Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: State transition S_IDLE -> S_POLICY_ENGINE > > > Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Calculated transition 59, saving inputs in > > > /var/lib/pacemaker/pengine/pe-input-101.bz2 > > > Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 59 (Complete=0, Pending=0, Fired=0, Skipped=0, > > > Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input- > > > 101.bz2): > > > Complete > > > Jan 18 02:03:18 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: State transition S_TRANSITION_ENGINE -> S_IDLE > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: State transition S_IDLE -> S_POLICY_ENGINE > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Detected active orphan srv01-test running on el8-a01n01 > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Clearing failure of srv01-test on el8-a01n02 because > > > resource > > > parameters have changed > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Removing srv01-test from el8-a01n01 > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Removing srv01-test from el8-a01n02 > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: * Stop srv01-test ( el8- > > > a01n01 > > > ) due to node availability > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Calculated transition 60, saving inputs in > > > /var/lib/pacemaker/pengine/pe-input-102.bz2 > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Initiating stop operation srv01-test_stop_0 on el8-a01n01 > > > Jan 18 02:03:35 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 60 aborted by deletion of > > > lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation > > > removal > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 60 action 11 (srv01-test_stop_0 on el8- > > > a01n01): > > > expected 'ok' but got 'error' > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 60 (Complete=2, Pending=0, Fired=0, Skipped=0, > > > Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input- > > > 102.bz2): > > > Complete > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > attrd[490048]: notice: > > > Setting fail-count-srv01-test#stop_0[el8-a01n01]: (unset) -> > > > INFINITY > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > attrd[490048]: notice: > > > Setting last-failure-srv01-test#stop_0[el8-a01n01]: (unset) -> > > > 1610935435 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Unexpected result (error) was recorded for stop of > > > srv01- > > > test > > > on el8-a01n01 at Jan 18 02:03:35 2021 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Unexpected result (error) was recorded for stop of > > > srv01- > > > test > > > on el8-a01n01 at Jan 18 02:03:35 2021 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Cluster node el8-a01n01 will be fenced: srv01-test > > > failed > > > there > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Detected active orphan srv01-test running on el8-a01n01 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Scheduling Node el8-a01n01 for STONITH > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Stop of failed resource srv01-test is implicit after el8- > > > a01n01 > > > is fenced > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: * Fence (reboot) el8-a01n01 'srv01-test failed there' > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: * Move virsh_node2_pulsar ( el8-a01n01 -> el8- > > > a01n02 ) > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: * Stop srv01-test ( el8- > > > a01n01 > > > ) due to node availability > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Calculated transition 61 (with warnings), saving inputs > > > in > > > /var/lib/pacemaker/pengine/pe-warn-1.bz2 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Unexpected result (error) was recorded for stop of > > > srv01- > > > test > > > on el8-a01n01 at Jan 18 02:03:35 2021 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Unexpected result (error) was recorded for stop of > > > srv01- > > > test > > > on el8-a01n01 at Jan 18 02:03:35 2021 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Cluster node el8-a01n01 will be fenced: srv01-test > > > failed > > > there > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Detected active orphan srv01-test running on el8-a01n01 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Forcing srv01-test away from el8-a01n01 after 1000000 > > > failures > > > (max=1000000) > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Clearing failure of srv01-test on el8-a01n01 because it > > > is > > > orphaned > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Scheduling Node el8-a01n01 for STONITH > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Stop of failed resource srv01-test is implicit after el8- > > > a01n01 > > > is fenced > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: * Fence (reboot) el8-a01n01 'srv01-test failed there' > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: * Move virsh_node2_pulsar ( el8-a01n01 -> el8- > > > a01n02 ) > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: * Stop srv01-test ( el8- > > > a01n01 > > > ) due to node availability > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > warning: Calculated transition 62 (with warnings), saving inputs > > > in > > > /var/lib/pacemaker/pengine/pe-warn-2.bz2 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Requesting fencing (reboot) of node el8-a01n01 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Initiating start operation virsh_node2_pulsar_start_0 > > > locally > > > on > > > el8-a01n02 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > Client pacemaker-controld.490050.72911c98 wants to fence (reboot) > > > 'el8-a01n01' with device '(any)' > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > Requesting peer fencing (reboot) targeting el8-a01n01 > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > attrd[490048]: notice: > > > Setting fail-count-srv01-test#stop_0[el8-a01n01]: INFINITY -> > > > (unset) > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > attrd[490048]: notice: > > > Setting last-failure-srv01-test#stop_0[el8-a01n01]: 1610935435 -> > > > (unset) > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > virsh_node2_pulsar is not eligible to fence (reboot) el8-a01n01: > > > static-list > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > virsh_node1_pulsar is eligible to fence (reboot) el8-a01n01: > > > static- > > > list > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 62 aborted by deletion of > > > lrm_rsc_op[@id='srv01-test_last_failure_0']: Resource operation > > > removal > > > Jan 18 02:03:55 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > Requesting that el8-a01n02 perform 'reboot' action targeting el8- > > > a01n01 > > > using 'virsh_node1_pulsar' > > > Jan 18 02:03:56 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Result of start operation for virsh_node2_pulsar on el8- > > > a01n02: ok > > > Jan 18 02:03:57 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > Operation 'reboot' [646769] (call 4 from pacemaker- > > > controld.490050) > > > for > > > host 'el8-a01n01' with device 'virsh_node1_pulsar' returned: 0 > > > (OK) > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > attrd[490048]: notice: > > > Node el8-a01n01 state is now lost > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > attrd[490048]: notice: > > > Removing all el8-a01n01 attributes for peer loss > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Node el8-a01n01 state is now lost > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > based[490045]: notice: > > > Node el8-a01n01 state is now lost > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > based[490045]: notice: > > > Purged 1 peer with id=1 and/or uname=el8-a01n01 from the > > > membership > > > cache > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > Node el8-a01n01 state is now lost > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > Purged 1 peer with id=1 and/or uname=el8-a01n01 from the > > > membership > > > cache > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > attrd[490048]: notice: > > > Purged 1 peer with id=1 and/or uname=el8-a01n01 from the > > > membership > > > cache > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > Action 'reboot' targeting el8-a01n01 using virsh_node1_pulsar on > > > behalf > > > of pacemaker-controld.490050@el8-a01n02: OK > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker- > > > fenced[490046]: notice: > > > Operation 'reboot' targeting el8-a01n01 on el8-a01n02 for > > > pacemaker-controld.490050@el8-a01n02.8ff64dd6: OK > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Stonith operation 4/2:62:0:e827eea0-dedc-4200-a207- > > > c4095621b3c6: > > > OK (0) > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Peer el8-a01n01 was terminated (reboot) by el8-a01n02 on > > > behalf > > > of pacemaker-controld.490050: OK > > > Jan 18 02:03:58 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 62 (Complete=5, Pending=0, Fired=0, Skipped=1, > > > Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): > > > Stopped > > > Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Removing srv01-test from el8-a01n02 > > > Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Calculated transition 63, saving inputs in > > > /var/lib/pacemaker/pengine/pe-input-103.bz2 > > > Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Initiating monitor operation > > > virsh_node2_pulsar_monitor_60000 > > > locally on el8-a01n02 > > > Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Initiating delete operation srv01-test_delete_0 locally > > > on > > > el8-a01n02 > > > Jan 18 02:03:59 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 63 aborted by deletion of > > > lrm_resource[@id='srv01-test']: Resource state removal > > > Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Result of monitor operation for virsh_node2_pulsar on > > > el8- > > > a01n02: ok > > > Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 63 (Complete=2, Pending=0, Fired=0, Skipped=0, > > > Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input- > > > 103.bz2): > > > Complete > > > Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker- > > > schedulerd[490049]: > > > notice: Calculated transition 64, saving inputs in > > > /var/lib/pacemaker/pengine/pe-input-104.bz2 > > > Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0, > > > Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input- > > > 104.bz2): > > > Complete > > > Jan 18 02:04:00 el8-a01n02.alteeve.ca pacemaker-controld[490050]: > > > notice: State transition S_TRANSITION_ENGINE -> S_IDLE > > > ==== > > > > > -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/