Re: [ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart
> Hi, > > On 01/08/2022 16:18, john tillman wrote: >> "john tillman" schrieb am 29.07.2022 um 22:51 in >>> Nachricht >>> : >> On Thursday 28 July 2022 at 22:17:01, john tillman wrote: >> >>> I have a two cluster setup with a qdevice. 'pcs quorum status' from >>> a >>> cluster node shows the qdevice casting a vote. On the qdevice node >>> 'corosyncââ¬âqnetdââ¬âtool ââ¬âs' says I have 2 >>> connected clients and 1 >>> cluster. >>> The vote count looks correct when I shutdown either one of the >>> cluster >>> nodes or the qdevice. So the voting seems to be working at this >>> point. >> >> Indeed ââ¬â shutting down 1 of 3 nodes leaves quorum intact, >> therefore >> everything >> still awake knows what's going on. >> >>> From this state, if I reboot both my cluster nodes at the same >>> time >> >> Ugh! >> >>> but leave the qdevice node running, the cluster will not see the >>> qdevice >>> when the nodes come back up: 'pcs quorum status' show 3 votes >>> expected >>> but >>> only 2 votes cast (from the cluster nodes). >> >> I would think this is to be expected, since if you reboot 2 out of 3 >> nodes, >> you completely lose quorum, so the single node left has no idea what >> to >> trust >> when the other nodes return. > > No, no. I do have quorum after the reboots. It is courtesy of the 2 > cluster nodes casting their quorum votes. However, the qdevice is > not > casting a vote so I am down to 2 out of 3 nodes. > > And the qdevice is not part of the cluster. It will never have any > resources running on it. Its job is just to vote. > > ââ¬âJohn > I thought maybe the problem was that the network wasn't ready when corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep 10" into it but that didn't change anything. >>> >>> This type of fix is broken anyway: You are not delaying, you are >>> waiting >>> for >>> an event (network up). >>> Basically the OS distribution should have configured it correctly >>> already. >>> >>> In SLES15 there is: >>> Requires=network-online.target >>> After=network-online.target >>> >> >> Thank you for the response. >> >> Yes, I saw that those values were correctly set in the service >> configuration file for corosync. The delay was just a test. I just >> wanted >> to make sure that it wasn't a race condition of bringing up the bond and >> trying to connect to the quorum node. >> >> I was grep'ing the corosync log for VOTEQ entries and noticed when it >> works I see consecutively: >> ... [VOTEQ ] Sending quorum callback, quorate = 0 >> ... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice] >> When it does not work I never see 'Received qdevice...' line in the log. >> Is there something else I can look for to find this problem? Some other >> test you can think of? Maybe some configuration of the votequorum >> service? > > maybe good start is to get cluster into state of "non working" qdevice > and then paste: > - /var/log/messages of corosync/qdevice > - output of `corosync-qdevice-tool -sv` (from nodes) and > `corosync-qnetd-tool -lv` (from machine where qnetd is running) > > "Received qdevice op 1 req from node 1 [QDevice]" it means qdevice is > registered (= corosync-qdevice was started) - if line is really missing > it can mean corosync-qdevice is not running - log or running > `corosync-qdevice -f -d` should give some insights why it is not running. > > Honza > > My corosync-qdevice service was not enabled at boot. Sigh. Thank you Honza for pointing that out! And thank you all for your patience and attention. John >> >> I could still use some advice with debugging this oddity. Or have I used up my quota of questions this year :ââ¬â) ââ¬âJohn >> >> Starting from a situation such as this, your only hope is to rebuilt >> the >> cluster from scratch, IMHO. >> >> >> Antony. >> >> ââ¬âââ¬â >> Police have found a cartoonist dead in his house. They say that >> details >> are >> currently sketchy. >> >> Please reply to >> the >> list; >> please >> *don't* >> CC >> me. >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >> > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manag
Re: [ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart
Hi, On 01/08/2022 16:18, john tillman wrote: "john tillman" schrieb am 29.07.2022 um 22:51 in Nachricht : On Thursday 28 July 2022 at 22:17:01, john tillman wrote: I have a two cluster setup with a qdevice. 'pcs quorum status' from a cluster node shows the qdevice casting a vote. On the qdevice node 'corosync‑qnetd‑tool ‑s' says I have 2 connected clients and 1 cluster. The vote count looks correct when I shutdown either one of the cluster nodes or the qdevice. So the voting seems to be working at this point. Indeed ‑ shutting down 1 of 3 nodes leaves quorum intact, therefore everything still awake knows what's going on. From this state, if I reboot both my cluster nodes at the same time Ugh! but leave the qdevice node running, the cluster will not see the qdevice when the nodes come back up: 'pcs quorum status' show 3 votes expected but only 2 votes cast (from the cluster nodes). I would think this is to be expected, since if you reboot 2 out of 3 nodes, you completely lose quorum, so the single node left has no idea what to trust when the other nodes return. No, no. I do have quorum after the reboots. It is courtesy of the 2 cluster nodes casting their quorum votes. However, the qdevice is not casting a vote so I am down to 2 out of 3 nodes. And the qdevice is not part of the cluster. It will never have any resources running on it. Its job is just to vote. ‑John I thought maybe the problem was that the network wasn't ready when corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep 10" into it but that didn't change anything. This type of fix is broken anyway: You are not delaying, you are waiting for an event (network up). Basically the OS distribution should have configured it correctly already. In SLES15 there is: Requires=network-online.target After=network-online.target Thank you for the response. Yes, I saw that those values were correctly set in the service configuration file for corosync. The delay was just a test. I just wanted to make sure that it wasn't a race condition of bringing up the bond and trying to connect to the quorum node. I was grep'ing the corosync log for VOTEQ entries and noticed when it works I see consecutively: ... [VOTEQ ] Sending quorum callback, quorate = 0 ... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice] When it does not work I never see 'Received qdevice...' line in the log. Is there something else I can look for to find this problem? Some other test you can think of? Maybe some configuration of the votequorum service? maybe good start is to get cluster into state of "non working" qdevice and then paste: - /var/log/messages of corosync/qdevice - output of `corosync-qdevice-tool -sv` (from nodes) and `corosync-qnetd-tool -lv` (from machine where qnetd is running) "Received qdevice op 1 req from node 1 [QDevice]" it means qdevice is registered (= corosync-qdevice was started) - if line is really missing it can mean corosync-qdevice is not running - log or running `corosync-qdevice -f -d` should give some insights why it is not running. Honza I could still use some advice with debugging this oddity. Or have I used up my quota of questions this year :‑) ‑John Starting from a situation such as this, your only hope is to rebuilt the cluster from scratch, IMHO. Antony. ‑‑ Police have found a cartoonist dead in his house. They say that details are currently sketchy. Please reply to the list; please *don't* CC me. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart
"john tillman" schrieb am 29.07.2022 um 22:51 in > Nachricht > : >>> > On Thursday 28 July 2022 at 22:17:01, john tillman wrote: > I have a two cluster setup with a qdevice. 'pcs quorum status' from a > cluster node shows the qdevice casting a vote. On the qdevice node > 'corosyncâqnetdâtool âs' says I have 2 connected clients and 1 > cluster. > The vote count looks correct when I shutdown either one of the > cluster > nodes or the qdevice. So the voting seems to be working at this > point. Indeed â shutting down 1 of 3 nodes leaves quorum intact, therefore everything still awake knows what's going on. > From this state, if I reboot both my cluster nodes at the same time Ugh! > but leave the qdevice node running, the cluster will not see the > qdevice > when the nodes come back up: 'pcs quorum status' show 3 votes > expected > but > only 2 votes cast (from the cluster nodes). I would think this is to be expected, since if you reboot 2 out of 3 nodes, you completely lose quorum, so the single node left has no idea what to trust when the other nodes return. >>> >>> No, no. I do have quorum after the reboots. It is courtesy of the 2 >>> cluster nodes casting their quorum votes. However, the qdevice is not >>> casting a vote so I am down to 2 out of 3 nodes. >>> >>> And the qdevice is not part of the cluster. It will never have any >>> resources running on it. Its job is just to vote. >>> >>> âJohn >>> >> >> I thought maybe the problem was that the network wasn't ready when >> corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep 10" >> into it but that didn't change anything. > > This type of fix is broken anyway: You are not delaying, you are waiting > for > an event (network up). > Basically the OS distribution should have configured it correctly already. > > In SLES15 there is: > Requires=network-online.target > After=network-online.target > Thank you for the response. Yes, I saw that those values were correctly set in the service configuration file for corosync. The delay was just a test. I just wanted to make sure that it wasn't a race condition of bringing up the bond and trying to connect to the quorum node. I was grep'ing the corosync log for VOTEQ entries and noticed when it works I see consecutively: ... [VOTEQ ] Sending quorum callback, quorate = 0 ... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice] When it does not work I never see 'Received qdevice...' line in the log. Is there something else I can look for to find this problem? Some other test you can think of? Maybe some configuration of the votequorum service? >> >> I could still use some advice with debugging this oddity. Or have I >> used >> up my quota of questions this year :â) >> >> âJohn >> Starting from a situation such as this, your only hope is to rebuilt the cluster from scratch, IMHO. Antony. ââ Police have found a cartoonist dead in his house. They say that details are currently sketchy. Please reply to the list; please *don't* CC me. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ >>> >>> >>> ___ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >>> >>> >> >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/