17.06.2016 15:05, Vladislav Bogdanov wrote:
03.05.2016 01:14, Ken Gaillot wrote:
On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
Hi,
Just found an issue with node is silently unfenced.
That is quite large setup (2 cluster nodes and 8 remote ones) with
a plenty of slowly starting resources (lustre filesystem).
Fencing was initiated due to resource stop failure.
lustre often starts very slowly due to internal recovery, and some such
resources were starting in that transition where another resource
failed to stop.
And, as transition did not finish in time specified by the
"failure-timeout" (set to 9 min), and was not aborted, that stop
failure was successfully cleaned.
There were transition aborts due to attribute changes, after that
stop failure happened, but fencing
was not initiated for some reason.
Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer
Ken, could this one be considered to be fixed before 1.1.15 is released?
I created https://github.com/ClusterLabs/pacemaker/pull/1072 for this
That is RFC, tested only to compile.
I hope that should be correct, please tell me if I do something damn
wrong, or if there could be a better way.
Best,
Vladislav
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org