Re: [ClusterLabs] Antwort: Antw: corosync/dlm fencing?

Jan Pokorný Tue, 17 Jul 2018 12:28:37 -0700

On 16/07/18 11:44 +0200, Philipp Achmüller wrote:
> Unfortunatly it is not obvious for me - the "grep fence" is attached
> in my original message.


Sifting your logs a bit:

> -------------------
> Node: siteb-2 (DC):
> 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]:   notice: Move 
> stonith-sbd#011(Started sitea-1 -> siteb-1)
> [...]
> 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]:   notice: Initiating 
> stop operation stonith-sbd_stop_0 on sitea-1
> [...]
> 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]:   notice: Initiating 
> start operation stonith-sbd_start_0 on siteb-1
> [...]
> 2018-06-28T09:02:38.414440+02:00 siteb-2 corosync[189245]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2018-06-28T09:02:52.080141+02:00 siteb-2 corosync[189245]:   [TOTEM ] A new 
> membership (192.168.121.55:2012) was formed. Members left: 2
> 2018-06-28T09:02:52.080537+02:00 siteb-2 corosync[189245]:   [TOTEM ] Failed 
> to receive the leave message. failed: 2
> 2018-06-28T09:02:52.083415+02:00 siteb-2 attrd[189258]:   notice: Node 
> siteb-1 state is now lost
> [...]
> 2018-06-28T09:02:52.084054+02:00 siteb-2 crmd[189260]:  warning: No reason to 
> expect node 2 to be down
> [...]
> 2018-06-28T09:02:52.084409+02:00 siteb-2 corosync[189245]:   [QUORUM] 
> Members[3]: 1 3 4
> 2018-06-28T09:02:52.084492+02:00 siteb-2 corosync[189245]:   [MAIN  ] 
> Completed service synchronization, ready to provide service.
> [...]
> 2018-06-28T09:02:52.085210+02:00 siteb-2 kernel: [80872.012486] dlm: closing 
> connection to node 2
> [...]
> 2018-06-28T09:02:53.098683+02:00 siteb-2 pengine[189259]:  warning: 
> Scheduling Node siteb-1 for STONITH

> -------------------
> Node sitea-1:
> 2018-06-28T09:02:38.413748+02:00 sitea-1 corosync[6661]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2018-06-28T09:02:52.079905+02:00 sitea-1 corosync[6661]:   [TOTEM ] A new 
> membership (192.168.121.55:2012) was formed. Members left: 2
> 2018-06-28T09:02:52.080306+02:00 sitea-1 corosync[6661]:   [TOTEM ] Failed to 
> receive the leave message. failed: 2
> 2018-06-28T09:02:52.082619+02:00 sitea-1 cib[9021]:   notice: Node siteb-1 
> state is now lost
> [...]
> 2018-06-28T09:02:52.083429+02:00 sitea-1 corosync[6661]:   [QUORUM] 
> Members[3]: 1 3 4
> 2018-06-28T09:02:52.083521+02:00 sitea-1 corosync[6661]:   [MAIN  ] Completed 
> service synchronization, ready to provide service.
> 2018-06-28T09:02:52.083606+02:00 sitea-1 crmd[9031]:   notice: Node siteb-1 
> state is now lost
> 2018-06-28T09:02:52.084290+02:00 sitea-1 dlm_controld[73416]: 59514 fence 
> request 2 pid 171087 nodedown time 1530169372 fence_all dlm_stonith
> 2018-06-28T09:02:52.085446+02:00 sitea-1 kernel: [59508.568940] dlm: closing 
> connection to node 2
> 2018-06-28T09:02:52.109393+02:00 sitea-1 dlm_stonith: stonith_api_time: Found 
> 0 entries for 2/(null): 0 in progress, 0 completed
> 2018-06-28T09:02:52.110167+02:00 sitea-1 stonith-ng[9022]:   notice: Client 
> stonith-api.171087.d3c59fc2 wants to fence (reboot) '2' with device '(any)'
> 2018-06-28T09:02:52.113257+02:00 sitea-1 stonith-ng[9022]:   notice: 
> Requesting peer fencing (reboot) of siteb-1
> 2018-06-28T09:03:29.096714+02:00 sitea-1 stonith-ng[9022]:   notice: 
> Operation reboot of siteb-1 by sitea-2 for 
> [email protected]: OK
> 2018-06-28T09:03:29.097152+02:00 sitea-1 stonith-api[171087]: 
> stonith_api_kick: Node 2/(null) kicked: reboot
> 2018-06-28T09:03:29.097426+02:00 sitea-1 crmd[9031]:   notice: Peer lnx0361b 
> was terminated (reboot) by sitea-2 on behalf of stonith-api.171087: OK
> 2018-06-28T09:03:30.098657+02:00 sitea-1 dlm_controld[73416]: 59552 fence 
> result 2 pid 171087 result 0 exit status
> 2018-06-28T09:03:30.099730+02:00 sitea-1 dlm_controld[73416]: 59552 fence 
> status 2 receive 0 from 1 walltime 1530169410 local 59552

> -------------------
> Node sitea-2:
> 2018-06-28T09:02:38.412808+02:00 sitea-2 corosync[6570]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2018-06-28T09:02:52.078249+02:00 sitea-2 corosync[6570]:   [TOTEM ] A new 
> membership (192.168.121.55:2012) was formed. Members left: 2
> 2018-06-28T09:02:52.078359+02:00 sitea-2 corosync[6570]:   [TOTEM ] Failed to 
> receive the leave message. failed: 2
> 2018-06-28T09:02:52.081949+02:00 sitea-2 cib[9655]:   notice: Node siteb-1 
> state is now lost
> [...]
> 2018-06-28T09:02:52.082653+02:00 sitea-2 corosync[6570]:   [QUORUM] 
> Members[3]: 1 3 4
> 2018-06-28T09:02:52.082739+02:00 sitea-2 corosync[6570]:   [MAIN  ] Completed 
> service synchronization, ready to provide service.
> [...]
> 2018-06-28T09:02:52.495697+02:00 sitea-2 stonith-ng[9656]:   notice: 
> stonith-sbd can fence (reboot) siteb-1: dynamic-list
> 2018-06-28T09:02:52.495902+02:00 sitea-2 stonith-ng[9656]:   notice: Delaying 
> reboot on stonith-sbd for 25358ms (timeout=300s)
> 2018-06-28T09:03:29.093957+02:00 sitea-2 stonith-ng[9656]:   notice: 
> Operation 'reboot' [231293] (call 2 from stonith-api.171087) for host 
> 'siteb-1' with device 'stonith-sbd' returned: 0 (OK)
> 2018-06-28T09:03:29.096254+02:00 sitea-2 stonith-ng[9656]:   notice: 
> Operation reboot of siteb-1 by sitea-2 for 
> [email protected]: OK
> 2018-06-28T09:03:29.096769+02:00 sitea-2 crmd[9660]:   notice: Peer siteb-1 
> was terminated (reboot) by sitea-2 on behalf of stonith-api.171087: OK

> -------------------
> Node sideb-1 has no entries during this timeframe
> 
> during standby corosync should be up/running - so may the "Failed to 
> receive the leave message" will be a Problem?

True, it should normally stay running, and that's actually the problem
here and you just seem to be confusing consequence with consequence:

- first: the corosync membership got broken on siteb-1 node
  (hard to tell why, the only thing we can observe is that
   sbd monitoring was moved there from sitea-1)

- then: dlm_controld on sitea-1, which already observed that membership
  issue of siteb-1, proceed to ask pacemaker to fence siteb-1, which
  was cheerfully executed by sitea-2 node

I cannot really tell what the root cause was here, but observing PID
as high as 171087, I wonder if you have recent enough libqb (0.17.2):

https://bugzilla.redhat.com/show_bug.cgi?id=1114852

(and would perhaps fit: new sbd process/es with high enough PIDs,
exercising libqb-backed IPC -> something bad happening like
corosync getting stuck or crashing -> node kicked out...)

-- 
Nazdar,
Jan (Poki)

pgpcn76HEcers.pgp
Description: PGP signature

_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antwort: Antw: corosync/dlm fencing?

Reply via email to