Hi,

Just found an issue with node is silently unfenced.

That is quite large setup (2 cluster nodes and 8 remote ones) with
a plenty of slowly starting resources (lustre filesystem).

Fencing was initiated due to resource stop failure.
lustre often starts very slowly due to internal recovery, and some such
resources were starting in that transition where another resource failed to 
stop.
And, as transition did not finish in time specified by the
"failure-timeout" (set to 9 min), and was not aborted, that stop failure was 
successfully cleaned.
There were transition aborts due to attribute changes, after that stop failure 
happened, but fencing
was not initiated for some reason.
Node where stop failed was a DC.
pacemaker is 1.1.14-5a6cdd1 (from fedora, built on EL7)

Here is log excerpt illustrating the above:
Apr 19 14:57:56 mds1 pengine[3452]:   notice: Move    mdt0-es03a-vg        
(Started mds1 -> mds0)
Apr 19 14:58:06 mds1 pengine[3452]:   notice: Move    mdt0-es03a-vg        
(Started mds1 -> mds0)
Apr 19 14:58:10 mds1 crmd[3453]:   notice: Initiating action 81: monitor 
mdt0-es03a-vg_monitor_0 on mds0
Apr 19 14:58:11 mds1 crmd[3453]:   notice: Initiating action 2993: stop 
mdt0-es03a-vg_stop_0 on mds1 (local)
Apr 19 14:58:11 mds1 LVM(mdt0-es03a-vg)[6228]: INFO: Deactivating volume group 
vg_mdt0_es03a
Apr 19 14:58:12 mds1 LVM(mdt0-es03a-vg)[6541]: ERROR: Logical volume 
vg_mdt0_es03a/mdt0 contains a filesystem in use. Can't deactivate volume group 
"vg_mdt0_es03a" with 1 open logical volume(s)
[...]
Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9939]: ERROR: LVM: vg_mdt0_es03a did 
not stop correctly
Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9943]: WARNING: vg_mdt0_es03a still 
Active
Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9947]: INFO: Retry deactivating volume 
group vg_mdt0_es03a
Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ 
ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
[...]
Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ 
ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
Apr 19 14:58:31 mds1 crmd[3453]:   notice: Operation mdt0-es03a-vg_stop_0: 
unknown error (node=mds1, call=324, rc=1, cib-update=1695, confirmed=true)
Apr 19 14:58:31 mds1 crmd[3453]:   notice: mds1-mdt0-es03a-vg_stop_0:324 [ 
ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: 
vg_mdt0_es03a did not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did 
not stop correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctl
Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) 
on mds1 failed (target: 0 vs. rc: 1): Error
Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) 
on mds1 failed (target: 0 vs. rc: 1): Error
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for 
mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for 
mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Node mds1 will be fenced because 
of resource failure(s)
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
mds1 after 1000000 failures (max=1000000)
Apr 19 15:02:03 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
Apr 19 15:02:03 mds1 pengine[3452]:   notice: Stop of failed resource 
mdt0-es03a-vg is implicit after mds1 is fenced
Apr 19 15:02:03 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        
(Started mds1 -> mds0)
[... many of these ]
Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for 
mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for 
mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:07:22 mds1 pengine[3452]:  warning: Node mds1 will be fenced because 
of resource failure(s)
Apr 19 15:07:22 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
mds1 after 1000000 failures (max=1000000)
Apr 19 15:07:23 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
Apr 19 15:07:23 mds1 pengine[3452]:   notice: Stop of failed resource 
mdt0-es03a-vg is implicit after mds1 is fenced
Apr 19 15:07:23 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        
(Started mds1 -> mds0)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Processing failed op stop for 
mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Processing failed op stop for 
mdt0-es03a-vg on mds1: unknown error (1)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Node mds1 will be fenced because 
of resource failure(s)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
mds1 after 1000000 failures (max=1000000)
Apr 19 15:07:24 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
Apr 19 15:07:24 mds1 pengine[3452]:   notice: Stop of failed resource 
mdt0-es03a-vg is implicit after mds1 is fenced
Apr 19 15:07:24 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg        
(Started mds1 -> mds0)
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for 
mdt0-es03a-vg on mds1
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for 
mdt0-es03a-vg on mds1
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Ignoring expired calculated 
failure mdt0-es03a-vg_stop_0 (rc=1, 
magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Clearing expired failcount for 
mdt0-es03a-vg on mds1
Apr 19 15:07:32 mds1 pengine[3452]:   notice: Ignoring expired calculated 
failure mdt0-es03a-vg_stop_0 (rc=1, 
magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
Apr 19 15:07:33 mds1 crmd[3453]:   notice: Initiating action 2016: monitor 
mdt0-es03a-vg_monitor_60000 on mds1 (local)
Apr 19 15:07:33 mds1 crmd[3453]:   notice: Transition aborted by deletion of 
nvpair[@id='status-2-fail-count-mdt0-es03a-vg']: Transient attribute change 
(cib=0.228.2601, source=abort_unless_down:343, 
path=/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-fail-count-mdt0-es03a-vg'],
 0)
Apr 19 15:10:09 mds1 pengine[3452]:   notice: Ignoring expired calculated 
failure mdt0-es03a-vg_stop_0 (rc=1, 
magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1
Apr 19 15:12:40 mds1 pengine[3452]:   notice: Ignoring expired calculated 
failure mdt0-es03a-vg_stop_0 (rc=1, 
magic=0:1;2993:12:0:78064510-7295-489e-a1e2-201618c9f374) on mds1

Best,
Vladislav

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to