Hi Jan, Hi Ken, Thanks for your comment.
I am going to check a little more about the problem of libqb. Many thanks, Hideo Yamauchi. ----- Original Message ----- > From: Ken Gaillot <kgail...@redhat.com> > To: Cluster Labs - All topics related to open-source clustering welcomed > <users@clusterlabs.org> > Cc: > Date: 2019/1/3, Thu 01:26 > Subject: Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine. > > On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote: >> On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote: >> > This problem occurred with our users. >> > >> > The following problem occurred in a two-node cluster that does not >> > set STONITH. >> > >> > The problem seems to have occurred in the following procedure. >> > >> > Step 1) Configure the cluster with 2 nodes. The DC node is the >> > second node. >> > Step 2) Several resources are running on the first node. >> > Step 3) It stops almost at the same time in order of 2nd node and >> > 1st node. >> >> Do I decipher the above correctly that the cluster is scheduled for >> shutdown (fully independently node by node or through a single >> trigger >> with a high level management tool?) and starts proceeding in serial >> manner, shutting 2nd node ~ original DC first? >> >> > Step 4) After the second node stops, the first node tries to >> > calculate the state transition for the resource stop. >> > >> > However, crmd fails to connect with pengine and does not calculate >> > state transitions. >> > >> > ----- >> > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client >> > connection failed, not adding channel to mainloop >> > ----- >> >> Sadly, it looks like details of why this happened would only be >> retained when debugging/tracing verbosity of the log messages >> was enabled, which likely wasn't the case. >> >> Anyway, perhaps providing a wider context of the log messages >> from this first node might shed some light into this. > > Agreed, that's probably the only hope. > > This would have to be a low-level issue like an out-of-memory error, or > something at the libqb level. > >> > As a result, Pacemaker will stop without stopping the resource. >> >> This might have serious consequences in some scenarios, perhaps >> unless some watchdog-based solution (SBD?) was used as a fencing >> of choice since it would not get defused just as the resource >> wasn't stopped, I think... > > Yep, this is unavoidable in this situation. If the last node standing > has an unrecoverable problem, there's no other node remaining to fence > it and recover. > >> > The problem seems to have occurred in the following environment. >> > >> > - libqb 1.0 >> > - corosync 2.4.1 >> > - Pacemaker 1.1.15 >> > >> > I tried to reproduce this problem, but for now it can not be >> > reproduced. >> > >> > Do you know the cause of this problem? >> >> No idea at this point. > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org