> Hi, > > On 01/08/2022 16:18, john tillman wrote: >>>>>> "john tillman" <jo...@panix.com> schrieb am 29.07.2022 um 22:51 in >>> Nachricht >>> <beb30bf64d4c615aff6034000038118c.squir...@mail.panix.com>: >>>>>> On Thursday 28 July 2022 at 22:17:01, john tillman wrote: >>>>>> >>>>>>> I have a two cluster setup with a qdevice. 'pcs quorum status' from >>>>>>> a >>>>>>> cluster node shows the qdevice casting a vote. On the qdevice node >>>>>>> 'corosyncââ¬âqnetdââ¬âtool ââ¬âs' says I have 2 >>>>>>> connected clients and 1 >>>>>>> cluster. >>>>>>> The vote count looks correct when I shutdown either one of the >>>>>>> cluster >>>>>>> nodes or the qdevice. So the voting seems to be working at this >>>>>>> point. >>>>>> >>>>>> Indeed ââ¬â shutting down 1 of 3 nodes leaves quorum intact, >>>>>> therefore >>>>>> everything >>>>>> still awake knows what's going on. >>>>>> >>>>>>> From this state, if I reboot both my cluster nodes at the same >>>>>>> time >>>>>> >>>>>> Ugh! >>>>>> >>>>>>> but leave the qdevice node running, the cluster will not see the >>>>>>> qdevice >>>>>>> when the nodes come back up: 'pcs quorum status' show 3 votes >>>>>>> expected >>>>>>> but >>>>>>> only 2 votes cast (from the cluster nodes). >>>>>> >>>>>> I would think this is to be expected, since if you reboot 2 out of 3 >>>>>> nodes, >>>>>> you completely lose quorum, so the single node left has no idea what >>>>>> to >>>>>> trust >>>>>> when the other nodes return. >>>>> >>>>> No, no. I do have quorum after the reboots. It is courtesy of the 2 >>>>> cluster nodes casting their quorum votes. However, the qdevice is >>>>> not >>>>> casting a vote so I am down to 2 out of 3 nodes. >>>>> >>>>> And the qdevice is not part of the cluster. It will never have any >>>>> resources running on it. Its job is just to vote. >>>>> >>>>> ââ¬âJohn >>>>> >>>> >>>> I thought maybe the problem was that the network wasn't ready when >>>> corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep >>>> 10" >>>> into it but that didn't change anything. >>> >>> This type of fix is broken anyway: You are not delaying, you are >>> waiting >>> for >>> an event (network up). >>> Basically the OS distribution should have configured it correctly >>> already. >>> >>> In SLES15 there is: >>> Requires=network-online.target >>> After=network-online.target >>> >> >> Thank you for the response. >> >> Yes, I saw that those values were correctly set in the service >> configuration file for corosync. The delay was just a test. I just >> wanted >> to make sure that it wasn't a race condition of bringing up the bond and >> trying to connect to the quorum node. >> >> I was grep'ing the corosync log for VOTEQ entries and noticed when it >> works I see consecutively: >> ... [VOTEQ ] Sending quorum callback, quorate = 0 >> ... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice] >> When it does not work I never see 'Received qdevice...' line in the log. >> Is there something else I can look for to find this problem? Some other >> test you can think of? Maybe some configuration of the votequorum >> service? > > maybe good start is to get cluster into state of "non working" qdevice > and then paste: > - /var/log/messages of corosync/qdevice > - output of `corosync-qdevice-tool -sv` (from nodes) and > `corosync-qnetd-tool -lv` (from machine where qnetd is running) > > "Received qdevice op 1 req from node 1 [QDevice]" it means qdevice is > registered (= corosync-qdevice was started) - if line is really missing > it can mean corosync-qdevice is not running - log or running > `corosync-qdevice -f -d` should give some insights why it is not running. > > Honza > >
My corosync-qdevice service was not enabled at boot. Sigh. Thank you Honza for pointing that out! And thank you all for your patience and attention. John >> >> >>>> >>>> I could still use some advice with debugging this oddity. Or have I >>>> used >>>> up my quota of questions this year :ââ¬â) >>>> >>>> ââ¬âJohn >>>> >>>>>> >>>>>> Starting from a situation such as this, your only hope is to rebuilt >>>>>> the >>>>>> cluster from scratch, IMHO. >>>>>> >>>>>> >>>>>> Antony. >>>>>> >>>>>> ââ¬âââ¬â >>>>>> Police have found a cartoonist dead in his house. They say that >>>>>> details >>>>>> are >>>>>> currently sketchy. >>>>>> >>>>>> Please reply to >>>>>> the >>>>>> list; >>>>>> please >>>>>> *don't* >>>>>> CC >>>>>> me. >>>>>> _______________________________________________ >>>>>> Manage your subscription: >>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>> >>>>>> ClusterLabs home: https://www.clusterlabs.org/ >>>>>> >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Manage your subscription: >>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>> >>>>> ClusterLabs home: https://www.clusterlabs.org/ >>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Manage your subscription: >>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> ClusterLabs home: https://www.clusterlabs.org/ >>> >>> >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >>> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/