Re: [ClusterLabs] corosync/dlm fencing?

2018-07-19 Thread Jan Pokorný
On 19/07/18 17:25 +0200, Philipp Achmüller wrote:
> "Users"  schrieb am 18.07.2018 15:46:09:
>> if it's unclear, 0.17.2 as the lowest version that's fixed
> 
> following version is currently installed with SP3:
> 
> libqb0-1.0.1-2.15.x86_64

Then the only blind bet is that this patch to libqb might have helped:
https://github.com/ClusterLabs/libqb/commit/75ab31bdd05a15947dc56edf4d6b7f377355435e

The issue that patch fixes might still fit the overall picture.

Other than that, it might also have been some kind of system overload
that prevented corosync responding to its peers in a timely fashion,
for instance.

-- 
Nazdar,
Jan (Poki)


pgp6ZguQyj0Mg.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync/dlm fencing?

2018-07-18 Thread Jan Pokorný
Just minor clarifications (without changing the validity) below:

On 17/07/18 21:28 +0200, Jan Pokorný wrote:
> 
> On 16/07/18 11:44 +0200, Philipp Achmüller wrote:
>> Unfortunatly it is not obvious for me - the "grep fence" is attached
>> in my original message.
> 
> Sifting your logs a bit:
> 
>> ---
>> Node: siteb-2 (DC):
>> 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]:   notice: Move 
>> stonith-sbd#011(Started sitea-1 -> siteb-1)
>> [...]
>> 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]:   notice: Initiating 
>> stop operation stonith-sbd_stop_0 on sitea-1
>> [...]
>> 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]:   notice: Initiating 
>> start operation stonith-sbd_start_0 on siteb-1
>> [...]
>> 2018-06-28T09:02:38.414440+02:00 siteb-2 corosync[189245]:   [TOTEM ] A 
>> processor failed, forming new configuration.
>> 2018-06-28T09:02:52.080141+02:00 siteb-2 corosync[189245]:   [TOTEM ] A new 
>> membership (192.168.121.55:2012) was formed. Members left: 2
>> 2018-06-28T09:02:52.080537+02:00 siteb-2 corosync[189245]:   [TOTEM ] Failed 
>> to receive the leave message. failed: 2
>> 2018-06-28T09:02:52.083415+02:00 siteb-2 attrd[189258]:   notice: Node 
>> siteb-1 state is now lost
>> [...]
>> 2018-06-28T09:02:52.084054+02:00 siteb-2 crmd[189260]:  warning: No reason 
>> to expect node 2 to be down
>> [...]
>> 2018-06-28T09:02:52.084409+02:00 siteb-2 corosync[189245]:   [QUORUM] 
>> Members[3]: 1 3 4
>> 2018-06-28T09:02:52.084492+02:00 siteb-2 corosync[189245]:   [MAIN  ] 
>> Completed service synchronization, ready to provide service.
>> [...]
>> 2018-06-28T09:02:52.085210+02:00 siteb-2 kernel: [80872.012486] dlm: closing 
>> connection to node 2
>> [...]
>> 2018-06-28T09:02:53.098683+02:00 siteb-2 pengine[189259]:  warning: 
>> Scheduling Node siteb-1 for STONITH
> 
>> ---
>> Node sitea-1:
>> 2018-06-28T09:02:38.413748+02:00 sitea-1 corosync[6661]:   [TOTEM ] A 
>> processor failed, forming new configuration.
>> 2018-06-28T09:02:52.079905+02:00 sitea-1 corosync[6661]:   [TOTEM ] A new 
>> membership (192.168.121.55:2012) was formed. Members left: 2
>> 2018-06-28T09:02:52.080306+02:00 sitea-1 corosync[6661]:   [TOTEM ] Failed 
>> to receive the leave message. failed: 2
>> 2018-06-28T09:02:52.082619+02:00 sitea-1 cib[9021]:   notice: Node siteb-1 
>> state is now lost
>> [...]
>> 2018-06-28T09:02:52.083429+02:00 sitea-1 corosync[6661]:   [QUORUM] 
>> Members[3]: 1 3 4
>> 2018-06-28T09:02:52.083521+02:00 sitea-1 corosync[6661]:   [MAIN  ] 
>> Completed service synchronization, ready to provide service.
>> 2018-06-28T09:02:52.083606+02:00 sitea-1 crmd[9031]:   notice: Node siteb-1 
>> state is now lost
>> 2018-06-28T09:02:52.084290+02:00 sitea-1 dlm_controld[73416]: 59514 fence 
>> request 2 pid 171087 nodedown time 1530169372 fence_all dlm_stonith
>> 2018-06-28T09:02:52.085446+02:00 sitea-1 kernel: [59508.568940] dlm: closing 
>> connection to node 2
>> 2018-06-28T09:02:52.109393+02:00 sitea-1 dlm_stonith: stonith_api_time: 
>> Found 0 entries for 2/(null): 0 in progress, 0 completed
>> 2018-06-28T09:02:52.110167+02:00 sitea-1 stonith-ng[9022]:   notice: Client 
>> stonith-api.171087.d3c59fc2 wants to fence (reboot) '2' with device '(any)'
>> 2018-06-28T09:02:52.113257+02:00 sitea-1 stonith-ng[9022]:   notice: 
>> Requesting peer fencing (reboot) of siteb-1
>> 2018-06-28T09:03:29.096714+02:00 sitea-1 stonith-ng[9022]:   notice: 
>> Operation reboot of siteb-1 by sitea-2 for 
>> stonith-api.171087@sitea-1.9fe08723: OK
>> 2018-06-28T09:03:29.097152+02:00 sitea-1 stonith-api[171087]: 
>> stonith_api_kick: Node 2/(null) kicked: reboot
>> 2018-06-28T09:03:29.097426+02:00 sitea-1 crmd[9031]:   notice: Peer lnx0361b 
>> was terminated (reboot) by sitea-2 on behalf of stonith-api.171087: OK
>> 2018-06-28T09:03:30.098657+02:00 sitea-1 dlm_controld[73416]: 59552 fence 
>> result 2 pid 171087 result 0 exit status
>> 2018-06-28T09:03:30.099730+02:00 sitea-1 dlm_controld[73416]: 59552 fence 
>> status 2 receive 0 from 1 walltime 1530169410 local 59552
> 
>> ---
>> Node sitea-2:
>> 2018-06-28T09:02:38.412808+02:00 sitea-2 corosync[6570]:   [TOTEM ] A 
>> processor failed, forming new configuration.
>> 2018-06-28T09:02:52.078249+02:00 sitea-2 corosync[6570]:   [TOTEM ] A new 
>> membership (192.168.121.55:2012) was formed. Members left: 2
>> 2018-06-28T09:02:52.078359+02:00 sitea-2 corosync[6570]:   [TOTEM ] Failed 
>> to receive the leave message. failed: 2
>> 2018-06-28T09:02:52.081949+02:00 sitea-2 cib[9655]:   notice: Node siteb-1 
>> state is now lost
>> [...]
>> 2018-06-28T09:02:52.082653+02:00 sitea-2 corosync[6570]:   [QUORUM] 
>> Members[3]: 1 3 4
>> 2018-06-28T09:02:52.082739+02:00 sitea-2 corosync[6570]:   [MAIN  ] 
>> Completed service synchronization, ready to provide service.
>> [...]
>> 2018-06-28T09:02:52.495697+02:00 sitea-2 stonith-ng[9656]:   notice: 
>> stonith-sbd can fence (reboot) siteb-1: dynamic-list
>> 2018-06-28T09:02

[ClusterLabs] corosync/dlm fencing?

2018-07-15 Thread Philipp Achmüller
hi!

i have a 4 node cluster running on SLES12 SP3
- pacemaker-1.1.16-4.8.x86_64
- corosync-2.3.6-9.5.1.x86_64

following configuration:

Stack: corosync
Current DC: sitea-2 (version 1.1.16-4.8-77ea74d) - partition with quorum
Last updated: Sun Jul 15 15:00:55 2018
Last change: Sat Jul 14 18:54:50 2018 by root via crm_resource on sitea-1

4 nodes configured
23 resources configured

Node sitea-1: online
1   (ocf::pacemaker:controld):  Active 
1   (ocf::lvm2:clvmd):  Active 
1   (ocf::pacemaker:SysInfo):   Active 
5   (ocf::heartbeat:VirtualDomain): Active 
1   (ocf::heartbeat:LVM):   Active 
Node siteb-1: online
1   (ocf::pacemaker:controld):  Active 
1   (ocf::lvm2:clvmd):  Active 
1   (ocf::pacemaker:SysInfo):   Active 
1   (ocf::heartbeat:VirtualDomain): Active 
1   (ocf::heartbeat:LVM):   Active 
Node sitea-2: online
1   (ocf::pacemaker:controld):  Active 
1   (ocf::lvm2:clvmd):  Active 
1   (ocf::pacemaker:SysInfo):   Active 
3   (ocf::heartbeat:VirtualDomain): Active 
1   (ocf::heartbeat:LVM):   Active 
Node siteb-2: online
1   (ocf::pacemaker:ClusterMon):Active 
3   (ocf::heartbeat:VirtualDomain): Active 
1   (ocf::pacemaker:SysInfo):   Active 
1   (stonith:external/sbd): Active 
1   (ocf::lvm2:clvmd):  Active 
1   (ocf::heartbeat:LVM):   Active 
1   (ocf::pacemaker:controld):  Active 

and these ordering:
...
group base-group dlm clvm vg1
clone base-clone base-group \
meta interleave=true target-role=Started ordered=true
colocation colocation-VM-base-clone-INFINITY inf: VM base-clone
order order-base-clone-VM-mandatory base-clone:start VM:start
...

for maintenance i would like to standby 1 or 2 nodes from "sitea" so that 
every Resources move off from these 2 images.
everything works fine until dlm stops as last resource on these nodes, 
then dlm_controld send fence_request - sometimes to the remaining online 
nodes, so there is online 1 node left in cluster

messages:


2018-07-14T14:38:56.441157+02:00 siteb-1 dlm_controld[39725]: 678 fence 
request 3 pid 54428 startup time 1531571371 fence_all dlm_stonith
2018-07-14T14:38:56.445284+02:00 siteb-1 dlm_stonith: stonith_api_time: 
Found 0 entries for 3/(null): 0 in progress, 0 completed
2018-07-14T14:38:56.446033+02:00 siteb-1 stonith-ng[8085]:   notice: 
Client stonith-api.54428.ee6a7e02 wants to fence (reboot) '3' with device 
'(any)'
2018-07-14T14:38:56.446294+02:00 siteb-1 stonith-ng[8085]:   notice: 
Requesting peer fencing (reboot) of sitea-2
...

 # dlm_tool dump_config
daemon_debug=0
foreground=0
log_debug=0
timewarn=0
protocol=detect
debug_logfile=0
enable_fscontrol=0
enable_plock=1
plock_debug=0
plock_rate_limit=0
plock_ownership=0
drop_resources_time=1
drop_resources_count=10
drop_resources_age=1
post_join_delay=30
enable_fencing=1
enable_concurrent_fencing=0
enable_startup_fencing=0
repeat_failed_fencing=1
enable_quorum_fencing=1
enable_quorum_lockspace=1
help=-1
version=-1

how to find out what is happening here?
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org