On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote: > On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote: > > This problem occurred with our users. > > > > The following problem occurred in a two-node cluster that does not > > set STONITH. > > > > The problem seems to have occurred in the following procedure. > > > > Step 1) Configure the cluster with 2 nodes. The DC node is the > > second node. > > Step 2) Several resources are running on the first node. > > Step 3) It stops almost at the same time in order of 2nd node and > > 1st node. > > Do I decipher the above correctly that the cluster is scheduled for > shutdown (fully independently node by node or through a single > trigger > with a high level management tool?) and starts proceeding in serial > manner, shutting 2nd node ~ original DC first? > > > Step 4) After the second node stops, the first node tries to > > calculate the state transition for the resource stop. > > > > However, crmd fails to connect with pengine and does not calculate > > state transitions. > > > > ----- > > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client > > connection failed, not adding channel to mainloop > > ----- > > Sadly, it looks like details of why this happened would only be > retained when debugging/tracing verbosity of the log messages > was enabled, which likely wasn't the case. > > Anyway, perhaps providing a wider context of the log messages > from this first node might shed some light into this.
Agreed, that's probably the only hope. This would have to be a low-level issue like an out-of-memory error, or something at the libqb level. > > As a result, Pacemaker will stop without stopping the resource. > > This might have serious consequences in some scenarios, perhaps > unless some watchdog-based solution (SBD?) was used as a fencing > of choice since it would not get defused just as the resource > wasn't stopped, I think... Yep, this is unavoidable in this situation. If the last node standing has an unrecoverable problem, there's no other node remaining to fence it and recover. > > The problem seems to have occurred in the following environment. > > > > - libqb 1.0 > > - corosync 2.4.1 > > - Pacemaker 1.1.15 > > > > I tried to reproduce this problem, but for now it can not be > > reproduced. > > > > Do you know the cause of this problem? > > No idea at this point. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org