Re: [ClusterLabs] Node is silently unfenced if transition is very long

2016-06-21 Thread Digimer
On 21/06/16 12:19 PM, Ken Gaillot wrote:
> On 06/17/2016 07:05 AM, Vladislav Bogdanov wrote:
>> 03.05.2016 01:14, Ken Gaillot wrote:
>>> On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
 Hi,

 Just found an issue with node is silently unfenced.

 That is quite large setup (2 cluster nodes and 8 remote ones) with
 a plenty of slowly starting resources (lustre filesystem).

 Fencing was initiated due to resource stop failure.
 lustre often starts very slowly due to internal recovery, and some such
 resources were starting in that transition where another resource
 failed to stop.
 And, as transition did not finish in time specified by the
 "failure-timeout" (set to 9 min), and was not aborted, that stop
 failure was successfully cleaned.
 There were transition aborts due to attribute changes, after that
 stop failure happened, but fencing
 was not initiated for some reason.
>>>
>>> Unfortunately, that makes sense with the current code. Failure timeout
>>> changes the node attribute, which aborts the transition, which causes a
>>> recalculation based on the new state, and the fencing is no longer
>>
>> Ken, could this one be considered to be fixed before 1.1.15 is released?
> 
> I'm planning to release 1.1.15 later today, and this won't make it in.
> 
> We do have several important open issues, including this one, but I
> don't want them to delay the release of the many fixes that are ready to
> go. I would only hold for a significant issue introduced this cycle, and
> none of the known issues appear to qualify.

I wonder if it would be worth appending a "known bugs/TODO" list to the
release announcements? Partly as a "heads-up" and partly as a way to
show folks what might be coming in .x+1.

>> I was just hit by the same in the completely different setup.
>> Two-node cluster, one node fails to stop a resource, and is fenced.
>> Right after that second node fails to activate clvm volume (different
>> story, need to investigate) and then fails to stop it. Node is scheduled
>> to be fenced, but it cannot be because first node didn't come up yet.
>> Any cleanup (automatic or manual) of a resource failed to stop clears
>> node state, removing "unclean" state from a node. That is probably not
>> what I could expect (resource cleanup is a node unfence)...
>> Honestly, this potentially leads to a data corruption...
>>
>> Also (probably not related) there was one more resource stop failure (in
>> that case - timeout) prior to failed stop mentioned above. And that stop
>> timeout did not lead to fencing by itself.
>>
>> I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so
>> any additional information from them can be easily provided.
>>
>> Best regards,
>> Vladislav


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node is silently unfenced if transition is very long

2016-06-21 Thread Ken Gaillot
On 06/17/2016 07:05 AM, Vladislav Bogdanov wrote:
> 03.05.2016 01:14, Ken Gaillot wrote:
>> On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
>>> Hi,
>>>
>>> Just found an issue with node is silently unfenced.
>>>
>>> That is quite large setup (2 cluster nodes and 8 remote ones) with
>>> a plenty of slowly starting resources (lustre filesystem).
>>>
>>> Fencing was initiated due to resource stop failure.
>>> lustre often starts very slowly due to internal recovery, and some such
>>> resources were starting in that transition where another resource
>>> failed to stop.
>>> And, as transition did not finish in time specified by the
>>> "failure-timeout" (set to 9 min), and was not aborted, that stop
>>> failure was successfully cleaned.
>>> There were transition aborts due to attribute changes, after that
>>> stop failure happened, but fencing
>>> was not initiated for some reason.
>>
>> Unfortunately, that makes sense with the current code. Failure timeout
>> changes the node attribute, which aborts the transition, which causes a
>> recalculation based on the new state, and the fencing is no longer
> 
> Ken, could this one be considered to be fixed before 1.1.15 is released?

I'm planning to release 1.1.15 later today, and this won't make it in.

We do have several important open issues, including this one, but I
don't want them to delay the release of the many fixes that are ready to
go. I would only hold for a significant issue introduced this cycle, and
none of the known issues appear to qualify.

> I was just hit by the same in the completely different setup.
> Two-node cluster, one node fails to stop a resource, and is fenced.
> Right after that second node fails to activate clvm volume (different
> story, need to investigate) and then fails to stop it. Node is scheduled
> to be fenced, but it cannot be because first node didn't come up yet.
> Any cleanup (automatic or manual) of a resource failed to stop clears
> node state, removing "unclean" state from a node. That is probably not
> what I could expect (resource cleanup is a node unfence)...
> Honestly, this potentially leads to a data corruption...
> 
> Also (probably not related) there was one more resource stop failure (in
> that case - timeout) prior to failed stop mentioned above. And that stop
> timeout did not lead to fencing by itself.
> 
> I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so
> any additional information from them can be easily provided.
> 
> Best regards,
> Vladislav

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node is silently unfenced if transition is very long

2016-06-17 Thread Vladislav Bogdanov

17.06.2016 15:05, Vladislav Bogdanov wrote:

03.05.2016 01:14, Ken Gaillot wrote:

On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:

Hi,

Just found an issue with node is silently unfenced.

That is quite large setup (2 cluster nodes and 8 remote ones) with
a plenty of slowly starting resources (lustre filesystem).

Fencing was initiated due to resource stop failure.
lustre often starts very slowly due to internal recovery, and some such
resources were starting in that transition where another resource
failed to stop.
And, as transition did not finish in time specified by the
"failure-timeout" (set to 9 min), and was not aborted, that stop
failure was successfully cleaned.
There were transition aborts due to attribute changes, after that
stop failure happened, but fencing
was not initiated for some reason.


Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer


Ken, could this one be considered to be fixed before 1.1.15 is released?


I created https://github.com/ClusterLabs/pacemaker/pull/1072 for this
That is RFC, tested only to compile.
I hope that should be correct, please tell me if I do something damn 
wrong, or if there could be a better way.


Best,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node is silently unfenced if transition is very long

2016-06-17 Thread Vladislav Bogdanov

03.05.2016 01:14, Ken Gaillot wrote:

On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:

Hi,

Just found an issue with node is silently unfenced.

That is quite large setup (2 cluster nodes and 8 remote ones) with
a plenty of slowly starting resources (lustre filesystem).

Fencing was initiated due to resource stop failure.
lustre often starts very slowly due to internal recovery, and some such
resources were starting in that transition where another resource failed to 
stop.
And, as transition did not finish in time specified by the
"failure-timeout" (set to 9 min), and was not aborted, that stop failure was 
successfully cleaned.
There were transition aborts due to attribute changes, after that stop failure 
happened, but fencing
was not initiated for some reason.


Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer


Ken, could this one be considered to be fixed before 1.1.15 is released?
I was just hit by the same in the completely different setup.
Two-node cluster, one node fails to stop a resource, and is fenced. 
Right after that second node fails to activate clvm volume (different 
story, need to investigate) and then fails to stop it. Node is scheduled 
to be fenced, but it cannot be because first node didn't come up yet.
Any cleanup (automatic or manual) of a resource failed to stop clears 
node state, removing "unclean" state from a node. That is probably not 
what I could expect (resource cleanup is a node unfence)...

Honestly, this potentially leads to a data corruption...

Also (probably not related) there was one more resource stop failure (in 
that case - timeout) prior to failed stop mentioned above. And that stop 
timeout did not lead to fencing by itself.


I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so 
any additional information from them can be easily provided.


Best regards,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node is silently unfenced if transition is very long

2016-05-02 Thread Ken Gaillot
On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
> Hi,
> 
> Just found an issue with node is silently unfenced.
> 
> That is quite large setup (2 cluster nodes and 8 remote ones) with
> a plenty of slowly starting resources (lustre filesystem).
> 
> Fencing was initiated due to resource stop failure.
> lustre often starts very slowly due to internal recovery, and some such
> resources were starting in that transition where another resource failed to 
> stop.
> And, as transition did not finish in time specified by the
> "failure-timeout" (set to 9 min), and was not aborted, that stop failure was 
> successfully cleaned.
> There were transition aborts due to attribute changes, after that stop 
> failure happened, but fencing
> was not initiated for some reason.

Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer
needed. I'll make a note to investigate a fix, but feel free to file a
bug report at bugs.clusterlabs.org for tracking purposes.

> Node where stop failed was a DC.
> pacemaker is 1.1.14-5a6cdd1 (from fedora, built on EL7)
> 
> Here is log excerpt illustrating the above:
> Apr 19 14:57:56 mds1 pengine[3452]:   notice: Movemdt0-es03a-vg
> (Started mds1 -> mds0)
> Apr 19 14:58:06 mds1 pengine[3452]:   notice: Movemdt0-es03a-vg
> (Started mds1 -> mds0)
> Apr 19 14:58:10 mds1 crmd[3453]:   notice: Initiating action 81: monitor 
> mdt0-es03a-vg_monitor_0 on mds0
> Apr 19 14:58:11 mds1 crmd[3453]:   notice: Initiating action 2993: stop 
> mdt0-es03a-vg_stop_0 on mds1 (local)
> Apr 19 14:58:11 mds1 LVM(mdt0-es03a-vg)[6228]: INFO: Deactivating volume 
> group vg_mdt0_es03a
> Apr 19 14:58:12 mds1 LVM(mdt0-es03a-vg)[6541]: ERROR: Logical volume 
> vg_mdt0_es03a/mdt0 contains a filesystem in use. Can't deactivate volume 
> group "vg_mdt0_es03a" with 1 open logical volume(s)
> [...]
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9939]: ERROR: LVM: vg_mdt0_es03a did 
> not stop correctly
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9943]: WARNING: vg_mdt0_es03a still 
> Active
> Apr 19 14:58:30 mds1 LVM(mdt0-es03a-vg)[9947]: INFO: Retry deactivating 
> volume group vg_mdt0_es03a
> Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
> [...]
> Apr 19 14:58:31 mds1 lrmd[3450]:   notice: mdt0-es03a-vg_stop_0:5865:stderr [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctly ]
> Apr 19 14:58:31 mds1 crmd[3453]:   notice: Operation mdt0-es03a-vg_stop_0: 
> unknown error (node=mds1, call=324, rc=1, cib-update=1695, confirmed=true)
> Apr 19 14:58:31 mds1 crmd[3453]:   notice: mds1-mdt0-es03a-vg_stop_0:324 [ 
> ocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop 
> correctly\nocf-exit-reason:LVM: vg_mdt0_es03a did not stop correctl
> Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) 
> on mds1 failed (target: 0 vs. rc: 1): Error
> Apr 19 14:58:31 mds1 crmd[3453]:  warning: Action 2993 (mdt0-es03a-vg_stop_0) 
> on mds1 failed (target: 0 vs. rc: 1): Error
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Node mds1 will be fenced 
> because of resource failure(s)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
> mds1 after 100 failures (max=100)
> Apr 19 15:02:03 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
> Apr 19 15:02:03 mds1 pengine[3452]:   notice: Stop of failed resource 
> mdt0-es03a-vg is implicit after mds1 is fenced
> Apr 19 15:02:03 mds1 pengine[3452]:   notice: Recover mdt0-es03a-vg
> (Started mds1 -> mds0)
> [... many of these ]
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Processing failed op stop for 
> mdt0-es03a-vg on mds1: unknown error (1)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Node mds1 will be fenced 
> because of resource failure(s)
> Apr 19 15:07:22 mds1 pengine[3452]:  warning: Forcing mdt0-es03a-vg away from 
> mds1 after 100 failures (max=100)
> Apr 19 15:07:23 mds1 pengine[3452]:  warning: Scheduling Node mds1 for STONITH
> Apr 19 15:07:23 mds1 pengine[3452]:   notice: Stop of failed resource 
>