On 21/06/16 12:19 PM, Ken Gaillot wrote: > On 06/17/2016 07:05 AM, Vladislav Bogdanov wrote: >> 03.05.2016 01:14, Ken Gaillot wrote: >>> On 04/19/2016 10:47 AM, Vladislav Bogdanov wrote: >>>> Hi, >>>> >>>> Just found an issue with node is silently unfenced. >>>> >>>> That is quite large setup (2 cluster nodes and 8 remote ones) with >>>> a plenty of slowly starting resources (lustre filesystem). >>>> >>>> Fencing was initiated due to resource stop failure. >>>> lustre often starts very slowly due to internal recovery, and some such >>>> resources were starting in that transition where another resource >>>> failed to stop. >>>> And, as transition did not finish in time specified by the >>>> "failure-timeout" (set to 9 min), and was not aborted, that stop >>>> failure was successfully cleaned. >>>> There were transition aborts due to attribute changes, after that >>>> stop failure happened, but fencing >>>> was not initiated for some reason. >>> >>> Unfortunately, that makes sense with the current code. Failure timeout >>> changes the node attribute, which aborts the transition, which causes a >>> recalculation based on the new state, and the fencing is no longer >> >> Ken, could this one be considered to be fixed before 1.1.15 is released? > > I'm planning to release 1.1.15 later today, and this won't make it in. > > We do have several important open issues, including this one, but I > don't want them to delay the release of the many fixes that are ready to > go. I would only hold for a significant issue introduced this cycle, and > none of the known issues appear to qualify.
I wonder if it would be worth appending a "known bugs/TODO" list to the release announcements? Partly as a "heads-up" and partly as a way to show folks what might be coming in .x+1. >> I was just hit by the same in the completely different setup. >> Two-node cluster, one node fails to stop a resource, and is fenced. >> Right after that second node fails to activate clvm volume (different >> story, need to investigate) and then fails to stop it. Node is scheduled >> to be fenced, but it cannot be because first node didn't come up yet. >> Any cleanup (automatic or manual) of a resource failed to stop clears >> node state, removing "unclean" state from a node. That is probably not >> what I could expect (resource cleanup is a node unfence)... >> Honestly, this potentially leads to a data corruption... >> >> Also (probably not related) there was one more resource stop failure (in >> that case - timeout) prior to failed stop mentioned above. And that stop >> timeout did not lead to fencing by itself. >> >> I have logs (but not pe-inputs/traces/blackboxes) from both nodes, so >> any additional information from them can be easily provided. >> >> Best regards, >> Vladislav -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org