Just minor clarifications (without changing the validity) below:
On 17/07/18 21:28 +0200, Jan Pokorný wrote:
>
> On 16/07/18 11:44 +0200, Philipp Achmüller wrote:
>> Unfortunatly it is not obvious for me - the "grep fence" is attached
>> in my original message.
>
> Sifting your logs a bit:
>
>> ---
>> Node: siteb-2 (DC):
>> 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]: notice: Move
>> stonith-sbd#011(Started sitea-1 -> siteb-1)
>> [...]
>> 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]: notice: Initiating
>> stop operation stonith-sbd_stop_0 on sitea-1
>> [...]
>> 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]: notice: Initiating
>> start operation stonith-sbd_start_0 on siteb-1
>> [...]
>> 2018-06-28T09:02:38.414440+02:00 siteb-2 corosync[189245]: [TOTEM ] A
>> processor failed, forming new configuration.
>> 2018-06-28T09:02:52.080141+02:00 siteb-2 corosync[189245]: [TOTEM ] A new
>> membership (192.168.121.55:2012) was formed. Members left: 2
>> 2018-06-28T09:02:52.080537+02:00 siteb-2 corosync[189245]: [TOTEM ] Failed
>> to receive the leave message. failed: 2
>> 2018-06-28T09:02:52.083415+02:00 siteb-2 attrd[189258]: notice: Node
>> siteb-1 state is now lost
>> [...]
>> 2018-06-28T09:02:52.084054+02:00 siteb-2 crmd[189260]: warning: No reason
>> to expect node 2 to be down
>> [...]
>> 2018-06-28T09:02:52.084409+02:00 siteb-2 corosync[189245]: [QUORUM]
>> Members[3]: 1 3 4
>> 2018-06-28T09:02:52.084492+02:00 siteb-2 corosync[189245]: [MAIN ]
>> Completed service synchronization, ready to provide service.
>> [...]
>> 2018-06-28T09:02:52.085210+02:00 siteb-2 kernel: [80872.012486] dlm: closing
>> connection to node 2
>> [...]
>> 2018-06-28T09:02:53.098683+02:00 siteb-2 pengine[189259]: warning:
>> Scheduling Node siteb-1 for STONITH
>
>> ---
>> Node sitea-1:
>> 2018-06-28T09:02:38.413748+02:00 sitea-1 corosync[6661]: [TOTEM ] A
>> processor failed, forming new configuration.
>> 2018-06-28T09:02:52.079905+02:00 sitea-1 corosync[6661]: [TOTEM ] A new
>> membership (192.168.121.55:2012) was formed. Members left: 2
>> 2018-06-28T09:02:52.080306+02:00 sitea-1 corosync[6661]: [TOTEM ] Failed
>> to receive the leave message. failed: 2
>> 2018-06-28T09:02:52.082619+02:00 sitea-1 cib[9021]: notice: Node siteb-1
>> state is now lost
>> [...]
>> 2018-06-28T09:02:52.083429+02:00 sitea-1 corosync[6661]: [QUORUM]
>> Members[3]: 1 3 4
>> 2018-06-28T09:02:52.083521+02:00 sitea-1 corosync[6661]: [MAIN ]
>> Completed service synchronization, ready to provide service.
>> 2018-06-28T09:02:52.083606+02:00 sitea-1 crmd[9031]: notice: Node siteb-1
>> state is now lost
>> 2018-06-28T09:02:52.084290+02:00 sitea-1 dlm_controld[73416]: 59514 fence
>> request 2 pid 171087 nodedown time 1530169372 fence_all dlm_stonith
>> 2018-06-28T09:02:52.085446+02:00 sitea-1 kernel: [59508.568940] dlm: closing
>> connection to node 2
>> 2018-06-28T09:02:52.109393+02:00 sitea-1 dlm_stonith: stonith_api_time:
>> Found 0 entries for 2/(null): 0 in progress, 0 completed
>> 2018-06-28T09:02:52.110167+02:00 sitea-1 stonith-ng[9022]: notice: Client
>> stonith-api.171087.d3c59fc2 wants to fence (reboot) '2' with device '(any)'
>> 2018-06-28T09:02:52.113257+02:00 sitea-1 stonith-ng[9022]: notice:
>> Requesting peer fencing (reboot) of siteb-1
>> 2018-06-28T09:03:29.096714+02:00 sitea-1 stonith-ng[9022]: notice:
>> Operation reboot of siteb-1 by sitea-2 for
>> stonith-api.171087@sitea-1.9fe08723: OK
>> 2018-06-28T09:03:29.097152+02:00 sitea-1 stonith-api[171087]:
>> stonith_api_kick: Node 2/(null) kicked: reboot
>> 2018-06-28T09:03:29.097426+02:00 sitea-1 crmd[9031]: notice: Peer lnx0361b
>> was terminated (reboot) by sitea-2 on behalf of stonith-api.171087: OK
>> 2018-06-28T09:03:30.098657+02:00 sitea-1 dlm_controld[73416]: 59552 fence
>> result 2 pid 171087 result 0 exit status
>> 2018-06-28T09:03:30.099730+02:00 sitea-1 dlm_controld[73416]: 59552 fence
>> status 2 receive 0 from 1 walltime 1530169410 local 59552
>
>> ---
>> Node sitea-2:
>> 2018-06-28T09:02:38.412808+02:00 sitea-2 corosync[6570]: [TOTEM ] A
>> processor failed, forming new configuration.
>> 2018-06-28T09:02:52.078249+02:00 sitea-2 corosync[6570]: [TOTEM ] A new
>> membership (192.168.121.55:2012) was formed. Members left: 2
>> 2018-06-28T09:02:52.078359+02:00 sitea-2 corosync[6570]: [TOTEM ] Failed
>> to receive the leave message. failed: 2
>> 2018-06-28T09:02:52.081949+02:00 sitea-2 cib[9655]: notice: Node siteb-1
>> state is now lost
>> [...]
>> 2018-06-28T09:02:52.082653+02:00 sitea-2 corosync[6570]: [QUORUM]
>> Members[3]: 1 3 4
>> 2018-06-28T09:02:52.082739+02:00 sitea-2 corosync[6570]: [MAIN ]
>> Completed service synchronization, ready to provide service.
>> [...]
>> 2018-06-28T09:02:52.495697+02:00 sitea-2 stonith-ng[9656]: notice:
>> stonith-sbd can fence (reboot) siteb-1: dynamic-list
>> 2018-06-28T09:02