Re: [ClusterLabs] Antw: [EXT] Cannot add a node with pcs

2022-07-13 Thread Piotr Szafarczyk

Hi Ulrich,

Thank you. I am perfectly aware that operating without stonith is not a 
good idea :). I am sure I will add it. But first I need to understand 
the current state. I am afraid of introducing something new before I fix 
the current problem.


Best regards,
Piotr

On 13.07.2022 08:00, Ulrich Windl wrote:

Piotr Szafarczyk  schrieb am 12.07.2022 um 12:34 in

Nachricht <38ccc24a-7b01-561c-20f8-ec2273a18...@netexpert.pl>:

Hi,

I used to have a working cluster with 3 nodes (and stonith disabled).

THE SLES guide says:
Important: No Support Without STONITH
You must have a node fencing mechanism for your cluster.
The global cluster options stonith-enabled and startup-fencing must be
set to true . When you change them, you lose support.

Maybe that helps.


After an unexpected restart of one node, the cluster split. The node #2
started to see the others as unclean. Nodes 1 and 2 were cooperating
with each other, showing #2 as offline. There were no network connection
problems.

I removed #2 (operating from #1) with
pcs cluster node remove n2

I verified that it had removed all configuration from #2, both for
corosync and for pacemaker. The cluster looks like working correctly
with two nodes (and no traces of #2).

Now I am trying to add the third node back.
pcs cluster node add n2
Disabling SBD service...
n2: sbd disabled
Sending 'corosync authkey', 'pacemaker authkey' to 'n2'
n2: successful distribution of the file 'corosync authkey'
n2: successful distribution of the file 'pacemaker authkey'
Sending updated corosync.conf to nodes...
n3: Succeeded
n2: Succeeded
n1: Succeeded
n3: Corosync configuration reloaded

I am able to start #2 operating from #1

pcs cluster pcsd-status
n2: Online
n3: Online
n1: Online

pcs cluster enable n2
pcs cluster start n2

I can see that corosync's configuration has been updated, but
pacemaker's not.

_Checking from #1:_

pcs config
Cluster Name: n
Corosync Nodes:
   n1 n3 n2
Pacemaker Nodes:
   n1 n3
[...]

pcs status
* 2 nodes configured
Node List:
* Online: [ n1 n3 ]
[...]

pcs cluster cib scope=nodes





_#2 is seeing the state differently:_

pcs config
Cluster Name: n
Corosync Nodes:
   n1 n3 n2
Pacemaker Nodes:
   n1 n2 n3

pcs status
* 3 nodes configured
Node List:
* Online: [ n2 ]
* OFFLINE: [ n1 n3 ]
Full List of Resources:
* No resources
[...]
(there are resources configured on #1 and #3)

pcs cluster cib scope=nodes






Help me diagnose it please. Where should I look for the problem? (I have
already tried a few things more - I see nothing helpful in log files,
pcs --debug shows nothing suspicious, tried even editing the CIB manually)

Best regards,

Piotr Szafarczyk




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Cannot add a node with pcs

2022-07-13 Thread Piotr Szafarczyk

Hi Tomas,

Thank you very much for the idea. I have played with stonith_admin 
--unfence and --confirm. Whenever I try, pcs status show my actions 
under Failed Fencing Actions. I see this in the log file:


error: Unfencing of n2 by  failed: No such device

No surprise here, since I have not got any devices registered.

If fencing of n2 was a cause, I would expect pcs status to show it as 
offline or unhealthy, but show it. I have got:


  * 2 nodes configured

Also I would expect node remove + node clear + node add to make n2 a 
brand new node.


Here are parts of the log when I remove n2 from the cluster

No peers with id=0 and/or uname=n2 to purge from the membership cache
Removing all n2 attributes for peer n3
Removing all n2 attributes for peer n1
Instructing peers to remove references to node n2/0
Completed cib_delete operation for section status: OK

There is nothing in the log file when I add it.

If fencing is the cause, where should I look for what the cluster tries 
to do?


Have you got any other suggestions what to check?

Best regards,
Piotr

On 12.07.2022 12:50, Tomas Jelinek wrote:

Hi Piotr,

Based on 'pcs cluster node add n2' and 'pcs config' outputs, pcs added 
the node to your cluster successfully, that is corosync config has 
been modified, distributed and loaded.


It looks like the problem is with pacemaker. This is a wild guess, but 
maybe pacemaker wants to fence n2, which is not possible, as you 
disabled stonith. In the meantime, n1 and n3 do not allow n2 to join, 
until it's confirmed fenced. Try looking into / posting 'pcs status 
--full' and pacemaker log.


With stonith disabled, you have a working cluster (seemingly). Until 
you don't, due to an event which requires working stonith for the 
cluster to recover.


Regards,
Tomas


Dne 12. 07. 22 v 12:34 Piotr Szafarczyk napsal(a):

Hi,

I used to have a working cluster with 3 nodes (and stonith disabled). 
After an unexpected restart of one node, the cluster split. The node 
#2 started to see the others as unclean. Nodes 1 and 2 were 
cooperating with each other, showing #2 as offline. There were no 
network connection problems.


I removed #2 (operating from #1) with
pcs cluster node remove n2

I verified that it had removed all configuration from #2, both for 
corosync and for pacemaker. The cluster looks like working correctly 
with two nodes (and no traces of #2).


Now I am trying to add the third node back.
pcs cluster node add n2
Disabling SBD service...
n2: sbd disabled
Sending 'corosync authkey', 'pacemaker authkey' to 'n2'
n2: successful distribution of the file 'corosync authkey'
n2: successful distribution of the file 'pacemaker authkey'
Sending updated corosync.conf to nodes...
n3: Succeeded
n2: Succeeded
n1: Succeeded
n3: Corosync configuration reloaded

I am able to start #2 operating from #1

pcs cluster pcsd-status
   n2: Online
   n3: Online
   n1: Online

pcs cluster enable n2
pcs cluster start n2

I can see that corosync's configuration has been updated, but 
pacemaker's not.


_Checking from #1:_

pcs config
Cluster Name: n
Corosync Nodes:
  n1 n3 n2
Pacemaker Nodes:
  n1 n3
[...]

pcs status
   * 2 nodes configured
Node List:
   * Online: [ n1 n3 ]
[...]

pcs cluster cib scope=nodes

   
   


_#2 is seeing the state differently:_

pcs config
Cluster Name: n
Corosync Nodes:
  n1 n3 n2
Pacemaker Nodes:
  n1 n2 n3

pcs status
   * 3 nodes configured
Node List:
   * Online: [ n2 ]
   * OFFLINE: [ n1 n3 ]
Full List of Resources:
   * No resources
[...]
(there are resources configured on #1 and #3)

pcs cluster cib scope=nodes

   
   
   


Help me diagnose it please. Where should I look for the problem? (I 
have already tried a few things more - I see nothing helpful in log 
files, pcs --debug shows nothing suspicious, tried even editing the 
CIB manually)


Best regards,

Piotr Szafarczyk


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Cannot add a node with pcs

2022-07-12 Thread Piotr Szafarczyk

Hi,

I used to have a working cluster with 3 nodes (and stonith disabled). 
After an unexpected restart of one node, the cluster split. The node #2 
started to see the others as unclean. Nodes 1 and 2 were cooperating 
with each other, showing #2 as offline. There were no network connection 
problems.


I removed #2 (operating from #1) with
pcs cluster node remove n2

I verified that it had removed all configuration from #2, both for 
corosync and for pacemaker. The cluster looks like working correctly 
with two nodes (and no traces of #2).


Now I am trying to add the third node back.
pcs cluster node add n2
Disabling SBD service...
n2: sbd disabled
Sending 'corosync authkey', 'pacemaker authkey' to 'n2'
n2: successful distribution of the file 'corosync authkey'
n2: successful distribution of the file 'pacemaker authkey'
Sending updated corosync.conf to nodes...
n3: Succeeded
n2: Succeeded
n1: Succeeded
n3: Corosync configuration reloaded

I am able to start #2 operating from #1

pcs cluster pcsd-status
  n2: Online
  n3: Online
  n1: Online

pcs cluster enable n2
pcs cluster start n2

I can see that corosync's configuration has been updated, but 
pacemaker's not.


_Checking from #1:_

pcs config
Cluster Name: n
Corosync Nodes:
 n1 n3 n2
Pacemaker Nodes:
 n1 n3
[...]

pcs status
  * 2 nodes configured
Node List:
  * Online: [ n1 n3 ]
[...]

pcs cluster cib scope=nodes

  
  


_#2 is seeing the state differently:_

pcs config
Cluster Name: n
Corosync Nodes:
 n1 n3 n2
Pacemaker Nodes:
 n1 n2 n3

pcs status
  * 3 nodes configured
Node List:
  * Online: [ n2 ]
  * OFFLINE: [ n1 n3 ]
Full List of Resources:
  * No resources
[...]
(there are resources configured on #1 and #3)

pcs cluster cib scope=nodes

  
  
  


Help me diagnose it please. Where should I look for the problem? (I have 
already tried a few things more - I see nothing helpful in log files, 
pcs --debug shows nothing suspicious, tried even editing the CIB manually)


Best regards,

Piotr Szafarczyk
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/