Re: [ClusterLabs] Antw: [EXT] Re: Noticed oddity when DC is going to be fenced
On Wed, 2022-03-02 at 08:41 +0100, Ulrich Windl wrote: > > > > Ken Gaillot schrieb am 01.03.2022 um > > > > 16:04 in > Nachricht > <463458e414f7c411eb1107335be6ee9a6e2d13ee.ca...@redhat.com>: > > On Tue, 2022‑03‑01 at 10:05 +0100, Ulrich Windl wrote: > > > Hi! > > > > > > For current SLES15 SP3 I noticed an oddity when the node running > > > the > > > DC is going to be fenced: > > > It seems that another node is performing recovery operations > > > while > > > the old DC is not confirmed to be fenced. > > > > > > Like this (116 is the DC): > > > Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership > > > (172.20.16.18:45612) was formed. Members left: 116 > > > > > > Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service > > > synchronization, ready to provide service. > > > Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Our peer > > > on > > > the DC (h16) is dead > > > Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: State > > > transition S_NOT_DC ‑> S_ELECTION > > > > At this point, h16 loses its DC status, so there is no DC > > I agree from the perspective of h18, but does the DC (h16) know, too? > > > > Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 > > > pid > > > 16307 nodedown time 1646094833 fence_all dlm_stonith > > > Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: State > > > transition S_ELECTION ‑> S_INTEGRATION > > > > At this point, a new DC election has completed, and h18 is now the > > DC > > (as indicated by the scheduler messages later) > > As above: Does h16 know at this point? It doesn't matter, because the first thing the new DC will do is fence it. That's the beauty of fencing :) > > > > Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 > > > entries for 116/(null): 0 in progress, 0 completed > > > Mar 01 01:33:53 h18 pacemaker‑fenced[6973]: notice: Client > > > stonith‑ > > > api.16307.4961743f wants to fence (reboot) '116' with device > > > '(any)' > > > Mar 01 01:33:53 h18 pacemaker‑fenced[6973]: notice: Requesting > > > peer > > > fencing (reboot) targeting h16 > > > > > > Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: Cluster > > > node h16 will be fenced: peer is no longer part of the cluster > > > Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: Node > > > h16 is > > > unclean > > > > > > (so far, so good) > > > Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: > > > Scheduling > > > Node h16 for STONITH > > > Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: notice: * Fence > > > (reboot) h16 'peer is no longer part of the cluster' > > > > > > Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Initiating > > > monitor operation prm_stonith_sbd_monitor_60 locally on h18 > > > Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Requesting > > > local execution of monitor operation for prm_stonith_sbd on h18 > > > Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Initiating > > > stop operation prm_cron_snap_v17_stop_0 on h19 > > > (isn't h18 playing DC already while h16 isn't fenced yet?) > > > > > > Mar 01 01:35:23 h18 pacemaker‑controld[6980]: error: Node h18 > > > did > > > not send monitor result (via controller) within 9ms (action > > > timeout plus cluster‑delay) > > > Mar 01 01:35:23 h18 pacemaker‑controld[6980]: error: > > > [Action 26]: > > > In‑flight resource op prm_stonith_sbd_monitor_60 on h18 > > > (priority: 9900, waiting: (null)) > > > Mar 01 01:35:23 h18 pacemaker‑controld[6980]: notice: Transition > > > 0 > > > aborted: Action lost > > > Mar 01 01:35:23 h18 pacemaker‑controld[6980]: warning: rsc_op > > > 26: > > > prm_stonith_sbd_monitor_60 on h18 timed out > > > (whatever that means) > > > > > > (now the fencing confirmation follows) > > > Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation > > > 'reboot' [16309] (call 2 from stonith‑api.16307) for host 'h16' > > > with > > > device 'prm_stonith_sbd' returned: 0 (OK) > > > Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation > > > 'reboot' targeting h16 on h18 for stonith‑api.16307@h18.36b9a9bb: > > > OK > > > Mar 01 01:35:55 h18 stonith‑api[16307]: stonith_api_kick: Node > > > 116/(null) kicked: reboot > > > Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation > > > 'reboot' targeting h16 on rksaph18 for > > > pacemaker‑controld.6980@h18.8ce2f33f (merged): OK > > > Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Peer h16 > > > was > > > terminated (reboot) by h18 on behalf of stonith‑api.16307: OK > > > Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Stonith > > > operation 2/1:0:0:a434124e‑3e35‑410d‑8e17‑ef9ae4e4e6eb: OK (0) > > > Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Peer h16 > > > was > > > terminated (reboot) by h18 on behalf of pacemaker‑controld.6980: > > > OK > > > > > > (actual recovery happens) > > > Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node > > > 116, > > > slot
[ClusterLabs] Antw: [EXT] Re: Noticed oddity when DC is going to be fenced
>>> Ken Gaillot schrieb am 01.03.2022 um 16:04 in Nachricht <463458e414f7c411eb1107335be6ee9a6e2d13ee.ca...@redhat.com>: > On Tue, 2022‑03‑01 at 10:05 +0100, Ulrich Windl wrote: >> Hi! >> >> For current SLES15 SP3 I noticed an oddity when the node running the >> DC is going to be fenced: >> It seems that another node is performing recovery operations while >> the old DC is not confirmed to be fenced. >> >> Like this (116 is the DC): >> Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership >> (172.20.16.18:45612) was formed. Members left: 116 >> >> Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Our peer on >> the DC (h16) is dead >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: State >> transition S_NOT_DC ‑> S_ELECTION > > At this point, h16 loses its DC status, so there is no DC I agree from the perspective of h18, but does the DC (h16) know, too? > >> >> Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid >> 16307 nodedown time 1646094833 fence_all dlm_stonith >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: State >> transition S_ELECTION ‑> S_INTEGRATION > > At this point, a new DC election has completed, and h18 is now the DC > (as indicated by the scheduler messages later) As above: Does h16 know at this point? > >> Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 >> entries for 116/(null): 0 in progress, 0 completed >> Mar 01 01:33:53 h18 pacemaker‑fenced[6973]: notice: Client stonith‑ >> api.16307.4961743f wants to fence (reboot) '116' with device '(any)' >> Mar 01 01:33:53 h18 pacemaker‑fenced[6973]: notice: Requesting peer >> fencing (reboot) targeting h16 >> >> Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: Cluster >> node h16 will be fenced: peer is no longer part of the cluster >> Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: Node h16 is >> unclean >> >> (so far, so good) >> Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: warning: Scheduling >> Node h16 for STONITH >> Mar 01 01:33:53 h18 pacemaker‑schedulerd[6978]: notice: * Fence >> (reboot) h16 'peer is no longer part of the cluster' >> >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Initiating >> monitor operation prm_stonith_sbd_monitor_60 locally on h18 >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Requesting >> local execution of monitor operation for prm_stonith_sbd on h18 >> Mar 01 01:33:53 h18 pacemaker‑controld[6980]: notice: Initiating >> stop operation prm_cron_snap_v17_stop_0 on h19 >> (isn't h18 playing DC already while h16 isn't fenced yet?) >> >> Mar 01 01:35:23 h18 pacemaker‑controld[6980]: error: Node h18 did >> not send monitor result (via controller) within 9ms (action >> timeout plus cluster‑delay) >> Mar 01 01:35:23 h18 pacemaker‑controld[6980]: error: [Action 26]: >> In‑flight resource op prm_stonith_sbd_monitor_60 on h18 >> (priority: 9900, waiting: (null)) >> Mar 01 01:35:23 h18 pacemaker‑controld[6980]: notice: Transition 0 >> aborted: Action lost >> Mar 01 01:35:23 h18 pacemaker‑controld[6980]: warning: rsc_op 26: >> prm_stonith_sbd_monitor_60 on h18 timed out >> (whatever that means) >> >> (now the fencing confirmation follows) >> Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation >> 'reboot' [16309] (call 2 from stonith‑api.16307) for host 'h16' with >> device 'prm_stonith_sbd' returned: 0 (OK) >> Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation >> 'reboot' targeting h16 on h18 for stonith‑api.16307@h18.36b9a9bb: OK >> Mar 01 01:35:55 h18 stonith‑api[16307]: stonith_api_kick: Node >> 116/(null) kicked: reboot >> Mar 01 01:35:55 h18 pacemaker‑fenced[6973]: notice: Operation >> 'reboot' targeting h16 on rksaph18 for >> pacemaker‑controld.6980@h18.8ce2f33f (merged): OK >> Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Peer h16 was >> terminated (reboot) by h18 on behalf of stonith‑api.16307: OK >> Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Stonith >> operation 2/1:0:0:a434124e‑3e35‑410d‑8e17‑ef9ae4e4e6eb: OK (0) >> Mar 01 01:35:55 h18 pacemaker‑controld[6980]: notice: Peer h16 was >> terminated (reboot) by h18 on behalf of pacemaker‑controld.6980: OK >> >> (actual recovery happens) >> Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, >> slot 0) on device (9,10) >> >> Mar 01 01:35:55 h18 kernel: md: md10: resync done. >> >> (more actions follow) >> Mar 01 01:35:56 h18 pacemaker‑schedulerd[6978]: notice: Calculated >> transition 1, saving inputs in /var/lib/pacemaker/pengine/pe‑input‑ >> 87.bz2 >> >> (actions completed) >> Mar 01 01:37:18 h18 pacemaker‑controld[6980]: notice: State >> transition S_TRANSITION_ENGINE ‑> S_IDLE >> >> (pacemaker‑2.0.5+20201202.ba59be712‑150300.4.16.1.x86_64) >> >> Did I misunderstand something, or does it look like a bug? >> >> Regards, >> Ulrich
[ClusterLabs] Antw: [EXT] Re: Noticed oddity when DC is going to be fenced
>>> Klaus Wenninger schrieb am 01.03.2022 um 10:22 in Nachricht : > On Tue, Mar 1, 2022 at 10:05 AM Ulrich Windl < > ulrich.wi...@rz.uni-regensburg.de> wrote: > >> Hi! >> >> For current SLES15 SP3 I noticed an oddity when the node running the DC is >> going to be fenced: >> It seems that another node is performing recovery operations while the old >> DC is not confirmed to be fenced. >> >> Like this (116 is the DC): >> Mar 01 01:33:53 h18 corosync[6754]: [TOTEM ] A new membership ( >> 172.20.16.18:45612) was formed. Members left: 116 >> >> Mar 01 01:33:53 h18 corosync[6754]: [MAIN ] Completed service >> synchronization, ready to provide service. >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Our peer on the DC >> (h16) is dead >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition >> S_NOT_DC -> S_ELECTION >> >> Mar 01 01:33:53 h18 dlm_controld[8544]: 394518 fence request 116 pid 16307 >> nodedown time 1646094833 fence_all dlm_stonith >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: State transition >> S_ELECTION -> S_INTEGRATION >> Mar 01 01:33:53 h18 dlm_stonith[16307]: stonith_api_time: Found 1 entries >> for 116/(null): 0 in progress, 0 completed >> Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Client >> stonith-api.16307.4961743f wants to fence (reboot) '116' with device '(any)' >> Mar 01 01:33:53 h18 pacemaker-fenced[6973]: notice: Requesting peer >> fencing (reboot) targeting h16 >> >> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Cluster node h16 >> will be fenced: peer is no longer part of the cluster >> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Node h16 is >> unclean >> >> (so far, so good) >> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: warning: Scheduling Node >> h16 for STONITH >> Mar 01 01:33:53 h18 pacemaker-schedulerd[6978]: notice: * Fence (reboot) >> h16 'peer is no longer part of the cluster' >> >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating monitor >> operation prm_stonith_sbd_monitor_60 locally on h18 >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Requesting local >> execution of monitor operation for prm_stonith_sbd on h18 >> Mar 01 01:33:53 h18 pacemaker-controld[6980]: notice: Initiating stop >> operation prm_cron_snap_v17_stop_0 on h19 >> (isn't h18 playing DC already while h16 isn't fenced yet?) >> > > periodic monitors should happen autonomously. > as long as you don't see pacemaker-schedulerd on h18 calculate a new > transition recovering the resources > everything should be fine. So what about "prm_cron_snap_v17_stop_0 on h19"? That's not a monitor. > and yes to a certain extent h18 is playing DC (it is elected to be new DC) > - somebody has to schedule fencing. I see the exception for fencing. Is it the quorum that allows h18 to "play DC" while the old DC could be still active on shared storage? Regards, Ulrich > > Klaus > >> >> Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: Node h18 did not >> send monitor result (via controller) within 9ms (action timeout plus >> cluster-delay) >> Mar 01 01:35:23 h18 pacemaker-controld[6980]: error: [Action 26]: >> In-flight resource op prm_stonith_sbd_monitor_60 on h18 (priority: >> 9900, waiting: (null)) >> Mar 01 01:35:23 h18 pacemaker-controld[6980]: notice: Transition 0 >> aborted: Action lost >> Mar 01 01:35:23 h18 pacemaker-controld[6980]: warning: rsc_op 26: >> prm_stonith_sbd_monitor_60 on h18 timed out >> (whatever that means) >> >> (now the fencing confirmation follows) >> Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' >> [16309] (call 2 from stonith-api.16307) for host 'h16' with device >> 'prm_stonith_sbd' returned: 0 (OK) >> Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' >> targeting h16 on h18 for stonith-api.16307@h18.36b9a9bb: OK >> Mar 01 01:35:55 h18 stonith-api[16307]: stonith_api_kick: Node 116/(null) >> kicked: reboot >> Mar 01 01:35:55 h18 pacemaker-fenced[6973]: notice: Operation 'reboot' >> targeting h16 on rksaph18 for pacemaker-controld.6980@h18.8ce2f33f >> (merged): OK >> Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was >> terminated (reboot) by h18 on behalf of stonith-api.16307: OK >> Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Stonith operation >> 2/1:0:0:a434124e-3e35-410d-8e17-ef9ae4e4e6eb: OK (0) >> Mar 01 01:35:55 h18 pacemaker-controld[6980]: notice: Peer h16 was >> terminated (reboot) by h18 on behalf of pacemaker-controld.6980: OK >> >> (actual recovery happens) >> Mar 01 01:35:55 h18 kernel: ocfs2: Begin replay journal (node 116, slot 0) >> on device (9,10) >> >> Mar 01 01:35:55 h18 kernel: md: md10: resync done. >> >> (more actions follow) >> Mar 01 01:35:56 h18 pacemaker-schedulerd[6978]: notice: Calculated >> transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-87.bz2 >> >> (actions completed) >> Mar 01 01:37:18 h18 pacemaker-controld[6980]: