Re: [ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart

2022-08-01 Thread john tillman
> Hi,
>
> On 01/08/2022 16:18, john tillman wrote:
>> "john tillman"  schrieb am 29.07.2022 um 22:51 in
>>> Nachricht
>>> :
>> On Thursday 28 July 2022 at 22:17:01, john tillman wrote:
>>
>>> I have a two cluster setup with a qdevice. 'pcs quorum status' from
>>> a
>>> cluster node shows the qdevice casting a vote.  On the qdevice node
>>> 'corosync‑qnetd‑tool ‑s' says I have 2
>>> connected clients and 1
>>> cluster.
>>> The vote count looks correct when I shutdown either one of the
>>> cluster
>>> nodes or the qdevice.  So the voting seems to be working at this
>>> point.
>>
>> Indeed ‑ shutting down 1 of 3 nodes leaves quorum intact,
>> therefore
>> everything
>> still awake knows what's going on.
>>
>>>  From this state, if I reboot both my cluster nodes at the same
>>> time
>>
>> Ugh!
>>
>>> but leave the qdevice node running, the cluster will not see the
>>> qdevice
>>> when the nodes come back up: 'pcs quorum status' show 3 votes
>>> expected
>>> but
>>> only 2 votes cast (from the cluster nodes).
>>
>> I would think this is to be expected, since if you reboot 2 out of 3
>> nodes,
>> you completely lose quorum, so the single node left has no idea what
>> to
>> trust
>> when the other nodes return.
>
> No, no.  I do have quorum after the reboots.  It is courtesy of the 2
> cluster nodes casting their quorum votes.  However, the qdevice is
> not
> casting a vote so I am down to 2 out of 3 nodes.
>
> And the qdevice is not part of the cluster.  It will never have any
> resources running on it.  Its job is just to vote.
>
> ‑John
>

 I thought maybe the problem was that the network wasn't ready when
 corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep
 10"
 into it but that didn't change anything.
>>>
>>> This type of fix is broken anyway: You are not delaying, you are
>>> waiting
>>> for
>>> an event (network up).
>>> Basically the OS distribution should have configured it correctly
>>> already.
>>>
>>> In SLES15 there is:
>>> Requires=network-online.target
>>> After=network-online.target
>>>
>>
>> Thank you for the response.
>>
>> Yes, I saw that those values were correctly set in the service
>> configuration file for corosync.  The delay was just a test. I just
>> wanted
>> to make sure that it wasn't a race condition of bringing up the bond and
>> trying to connect to the quorum node.
>>
>> I was grep'ing the corosync log for VOTEQ entries and noticed when it
>> works I see consecutively:
>> ... [VOTEQ ] Sending quorum callback, quorate = 0
>> ... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice]
>> When it does not work I never see 'Received qdevice...' line in the log.
>> Is there something else I can look for to find this problem?  Some other
>> test you can think of?  Maybe some configuration of the votequorum
>> service?
>
> maybe good start is to get cluster into state of "non working" qdevice
> and then paste:
> - /var/log/messages of corosync/qdevice
> - output of `corosync-qdevice-tool -sv` (from nodes) and
> `corosync-qnetd-tool -lv` (from machine where qnetd is running)
>
> "Received qdevice op 1 req from node 1 [QDevice]" it means qdevice is
> registered (= corosync-qdevice was started) - if line is really missing
> it can mean corosync-qdevice is not running - log or running
> `corosync-qdevice -f -d` should give some insights why it is not running.
>
> Honza
>
>

My corosync-qdevice service was not enabled at boot.  Sigh.

Thank you Honza for pointing that out!  And thank you all for your
patience and attention.

John

>>
>>

 I could still use some advice with debugging this oddity.  Or have I
 used
 up my quota of questions this year :‑)

 ‑John

>>
>> Starting from a situation such as this, your only hope is to rebuilt
>> the
>> cluster from scratch, IMHO.
>>
>>
>> Antony.
>>
>> ‑‑
>> Police have found a cartoonist dead in his house.  They say that
>> details
>> are
>> currently sketchy.
>>
>> Please reply to
>> the
>> list;
>>   please
>> *don't*
>> CC
>> me.
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>


 ___
 Manag

Re: [ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart

2022-08-01 Thread Jan Friesse

Hi,

On 01/08/2022 16:18, john tillman wrote:

"john tillman"  schrieb am 29.07.2022 um 22:51 in

Nachricht
:

On Thursday 28 July 2022 at 22:17:01, john tillman wrote:


I have a two cluster setup with a qdevice. 'pcs quorum status' from a
cluster node shows the qdevice casting a vote.  On the qdevice node
'corosync‑qnetd‑tool ‑s' says I have 2 connected clients and 1
cluster.
The vote count looks correct when I shutdown either one of the
cluster
nodes or the qdevice.  So the voting seems to be working at this
point.


Indeed ‑ shutting down 1 of 3 nodes leaves quorum intact, therefore
everything
still awake knows what's going on.


 From this state, if I reboot both my cluster nodes at the same time


Ugh!


but leave the qdevice node running, the cluster will not see the
qdevice
when the nodes come back up: 'pcs quorum status' show 3 votes
expected
but
only 2 votes cast (from the cluster nodes).


I would think this is to be expected, since if you reboot 2 out of 3
nodes,
you completely lose quorum, so the single node left has no idea what
to
trust
when the other nodes return.


No, no.  I do have quorum after the reboots.  It is courtesy of the 2
cluster nodes casting their quorum votes.  However, the qdevice is not
casting a vote so I am down to 2 out of 3 nodes.

And the qdevice is not part of the cluster.  It will never have any
resources running on it.  Its job is just to vote.

‑John



I thought maybe the problem was that the network wasn't ready when
corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep 10"
into it but that didn't change anything.


This type of fix is broken anyway: You are not delaying, you are waiting
for
an event (network up).
Basically the OS distribution should have configured it correctly already.

In SLES15 there is:
Requires=network-online.target
After=network-online.target



Thank you for the response.

Yes, I saw that those values were correctly set in the service
configuration file for corosync.  The delay was just a test. I just wanted
to make sure that it wasn't a race condition of bringing up the bond and
trying to connect to the quorum node.

I was grep'ing the corosync log for VOTEQ entries and noticed when it
works I see consecutively:
... [VOTEQ ] Sending quorum callback, quorate = 0
... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice]
When it does not work I never see 'Received qdevice...' line in the log.
Is there something else I can look for to find this problem?  Some other
test you can think of?  Maybe some configuration of the votequorum
service?


maybe good start is to get cluster into state of "non working" qdevice 
and then paste:

- /var/log/messages of corosync/qdevice
- output of `corosync-qdevice-tool -sv` (from nodes) and 
`corosync-qnetd-tool -lv` (from machine where qnetd is running)


"Received qdevice op 1 req from node 1 [QDevice]" it means qdevice is 
registered (= corosync-qdevice was started) - if line is really missing 
it can mean corosync-qdevice is not running - log or running 
`corosync-qdevice -f -d` should give some insights why it is not running.


Honza







I could still use some advice with debugging this oddity.  Or have I
used
up my quota of questions this year :‑)

‑John



Starting from a situation such as this, your only hope is to rebuilt
the
cluster from scratch, IMHO.


Antony.

‑‑
Police have found a cartoonist dead in his house.  They say that
details
are
currently sketchy.

Please reply to the
list;
  please
*don't*
CC
me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/





___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/





___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart

2022-08-01 Thread john tillman
 "john tillman"  schrieb am 29.07.2022 um 22:51 in
> Nachricht
> :
>>> > On Thursday 28 July 2022 at 22:17:01, john tillman wrote:

> I have a two cluster setup with a qdevice. 'pcs quorum status' from a
> cluster node shows the qdevice casting a vote.  On the qdevice node
> 'corosync‑qnetd‑tool ‑s' says I have 2 connected clients and 1
> cluster.
> The vote count looks correct when I shutdown either one of the
> cluster
> nodes or the qdevice.  So the voting seems to be working at this
> point.

 Indeed ‑ shutting down 1 of 3 nodes leaves quorum intact, therefore
 everything
 still awake knows what's going on.

> From this state, if I reboot both my cluster nodes at the same time

 Ugh!

> but leave the qdevice node running, the cluster will not see the
> qdevice
> when the nodes come back up: 'pcs quorum status' show 3 votes
> expected
> but
> only 2 votes cast (from the cluster nodes).

 I would think this is to be expected, since if you reboot 2 out of 3
 nodes,
 you completely lose quorum, so the single node left has no idea what
 to
 trust
 when the other nodes return.
>>>
>>> No, no.  I do have quorum after the reboots.  It is courtesy of the 2
>>> cluster nodes casting their quorum votes.  However, the qdevice is not
>>> casting a vote so I am down to 2 out of 3 nodes.
>>>
>>> And the qdevice is not part of the cluster.  It will never have any
>>> resources running on it.  Its job is just to vote.
>>>
>>> ‑John
>>>
>>
>> I thought maybe the problem was that the network wasn't ready when
>> corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep 10"
>> into it but that didn't change anything.
>
> This type of fix is broken anyway: You are not delaying, you are waiting
> for
> an event (network up).
> Basically the OS distribution should have configured it correctly already.
>
> In SLES15 there is:
> Requires=network-online.target
> After=network-online.target
>

Thank you for the response.

Yes, I saw that those values were correctly set in the service
configuration file for corosync.  The delay was just a test. I just wanted
to make sure that it wasn't a race condition of bringing up the bond and
trying to connect to the quorum node.

I was grep'ing the corosync log for VOTEQ entries and noticed when it
works I see consecutively:
... [VOTEQ ] Sending quorum callback, quorate = 0
... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice]
When it does not work I never see 'Received qdevice...' line in the log.
Is there something else I can look for to find this problem?  Some other
test you can think of?  Maybe some configuration of the votequorum
service?


>>
>> I could still use some advice with debugging this oddity.  Or have I
>> used
>> up my quota of questions this year :‑)
>>
>> ‑John
>>

 Starting from a situation such as this, your only hope is to rebuilt
 the
 cluster from scratch, IMHO.


 Antony.

 ‑‑
 Police have found a cartoonist dead in his house.  They say that
 details
 are
 currently sketchy.

Please reply to the
 list;
  please
 *don't*
 CC
 me.
 ___
 Manage your subscription:
 https://lists.clusterlabs.org/mailman/listinfo/users

 ClusterLabs home: https://www.clusterlabs.org/


>>>
>>>
>>> ___
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>>
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/