03.05.2016 01:14, Ken Gaillot wrote:
On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote:
Hi,

Just found an issue with node is silently unfenced.

That is quite large setup (2 cluster nodes and 8 remote ones) with
a plenty of slowly starting resources (lustre filesystem).

Fencing was initiated due to resource stop failure.
lustre often starts very slowly due to internal recovery, and some such
resources were starting in that transition where another resource failed to 
stop.
And, as transition did not finish in time specified by the
"failure-timeout" (set to 9 min), and was not aborted, that stop failure was 
successfully cleaned.
There were transition aborts due to attribute changes, after that stop failure 
happened, but fencing
was not initiated for some reason.

Unfortunately, that makes sense with the current code. Failure timeout
changes the node attribute, which aborts the transition, which causes a
recalculation based on the new state, and the fencing is no longer

Ken, could this one be considered to be fixed before 1.1.15 is released?
I was just hit by the same in the completely different setup.
Two-node cluster, one node fails to stop a resource, and is fenced. Right after that second node fails to activate clvm volume (different story, need to investigate) and then fails to stop it. Node is scheduled to be fenced, but it cannot be because first node didn't come up yet. Any cleanup (automatic or manual) of a resource failed to stop clears node state, removing "unclean" state from a node. That is probably not what I could expect (resource cleanup is a node unfence)...
Honestly, this potentially leads to a data corruption...

Also (probably not related) there was one more resource stop failure (in that case - timeout) prior to failed stop mentioned above. And that stop timeout did not lead to fencing by itself.

I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so any additional information from them can be easily provided.

Best regards,
Vladislav


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to