Re: [ClusterLabs] Cannot add a node with pcs
Hi Piotr, Sorry for the delay. I'm not a pacemaker expert, so I don't really know how pacemaker behaves in various corner cases. Even if I were, it would be difficult to advise you, since you haven't even posted what version of pacemaker / corosync / pcs you are using. In any case, the first thing you need to do is configure stonith. Properly configured and working stonith is required for a cluster to operate. There is no way around it. Regards, Tomas Dne 13. 07. 22 v 18:54 Piotr Szafarczyk napsal(a): Hi Tomas, Thank you very much for the idea. I have played with stonith_admin --unfence and --confirm. Whenever I try, pcs status show my actions under Failed Fencing Actions. I see this in the log file: error: Unfencing of n2 by failed: No such device No surprise here, since I have not got any devices registered. If fencing of n2 was a cause, I would expect pcs status to show it as offline or unhealthy, but show it. I have got: * 2 nodes configured Also I would expect node remove + node clear + node add to make n2 a brand new node. Here are parts of the log when I remove n2 from the cluster No peers with id=0 and/or uname=n2 to purge from the membership cache Removing all n2 attributes for peer n3 Removing all n2 attributes for peer n1 Instructing peers to remove references to node n2/0 Completed cib_delete operation for section status: OK There is nothing in the log file when I add it. If fencing is the cause, where should I look for what the cluster tries to do? Have you got any other suggestions what to check? Best regards, Piotr On 12.07.2022 12:50, Tomas Jelinek wrote: Hi Piotr, Based on 'pcs cluster node add n2' and 'pcs config' outputs, pcs added the node to your cluster successfully, that is corosync config has been modified, distributed and loaded. It looks like the problem is with pacemaker. This is a wild guess, but maybe pacemaker wants to fence n2, which is not possible, as you disabled stonith. In the meantime, n1 and n3 do not allow n2 to join, until it's confirmed fenced. Try looking into / posting 'pcs status --full' and pacemaker log. With stonith disabled, you have a working cluster (seemingly). Until you don't, due to an event which requires working stonith for the cluster to recover. Regards, Tomas Dne 12. 07. 22 v 12:34 Piotr Szafarczyk napsal(a): Hi, I used to have a working cluster with 3 nodes (and stonith disabled). After an unexpected restart of one node, the cluster split. The node #2 started to see the others as unclean. Nodes 1 and 2 were cooperating with each other, showing #2 as offline. There were no network connection problems. I removed #2 (operating from #1) with pcs cluster node remove n2 I verified that it had removed all configuration from #2, both for corosync and for pacemaker. The cluster looks like working correctly with two nodes (and no traces of #2). Now I am trying to add the third node back. pcs cluster node add n2 Disabling SBD service... n2: sbd disabled Sending 'corosync authkey', 'pacemaker authkey' to 'n2' n2: successful distribution of the file 'corosync authkey' n2: successful distribution of the file 'pacemaker authkey' Sending updated corosync.conf to nodes... n3: Succeeded n2: Succeeded n1: Succeeded n3: Corosync configuration reloaded I am able to start #2 operating from #1 pcs cluster pcsd-status n2: Online n3: Online n1: Online pcs cluster enable n2 pcs cluster start n2 I can see that corosync's configuration has been updated, but pacemaker's not. _Checking from #1:_ pcs config Cluster Name: n Corosync Nodes: n1 n3 n2 Pacemaker Nodes: n1 n3 [...] pcs status * 2 nodes configured Node List: * Online: [ n1 n3 ] [...] pcs cluster cib scope=nodes _#2 is seeing the state differently:_ pcs config Cluster Name: n Corosync Nodes: n1 n3 n2 Pacemaker Nodes: n1 n2 n3 pcs status * 3 nodes configured Node List: * Online: [ n2 ] * OFFLINE: [ n1 n3 ] Full List of Resources: * No resources [...] (there are resources configured on #1 and #3) pcs cluster cib scope=nodes Help me diagnose it please. Where should I look for the problem? (I have already tried a few things more - I see nothing helpful in log files, pcs --debug shows nothing suspicious, tried even editing the CIB manually) Best regards, Piotr Szafarczyk ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://li
Re: [ClusterLabs] Cannot add a node with pcs
Hi Tomas, Thank you very much for the idea. I have played with stonith_admin --unfence and --confirm. Whenever I try, pcs status show my actions under Failed Fencing Actions. I see this in the log file: error: Unfencing of n2 by failed: No such device No surprise here, since I have not got any devices registered. If fencing of n2 was a cause, I would expect pcs status to show it as offline or unhealthy, but show it. I have got: * 2 nodes configured Also I would expect node remove + node clear + node add to make n2 a brand new node. Here are parts of the log when I remove n2 from the cluster No peers with id=0 and/or uname=n2 to purge from the membership cache Removing all n2 attributes for peer n3 Removing all n2 attributes for peer n1 Instructing peers to remove references to node n2/0 Completed cib_delete operation for section status: OK There is nothing in the log file when I add it. If fencing is the cause, where should I look for what the cluster tries to do? Have you got any other suggestions what to check? Best regards, Piotr On 12.07.2022 12:50, Tomas Jelinek wrote: Hi Piotr, Based on 'pcs cluster node add n2' and 'pcs config' outputs, pcs added the node to your cluster successfully, that is corosync config has been modified, distributed and loaded. It looks like the problem is with pacemaker. This is a wild guess, but maybe pacemaker wants to fence n2, which is not possible, as you disabled stonith. In the meantime, n1 and n3 do not allow n2 to join, until it's confirmed fenced. Try looking into / posting 'pcs status --full' and pacemaker log. With stonith disabled, you have a working cluster (seemingly). Until you don't, due to an event which requires working stonith for the cluster to recover. Regards, Tomas Dne 12. 07. 22 v 12:34 Piotr Szafarczyk napsal(a): Hi, I used to have a working cluster with 3 nodes (and stonith disabled). After an unexpected restart of one node, the cluster split. The node #2 started to see the others as unclean. Nodes 1 and 2 were cooperating with each other, showing #2 as offline. There were no network connection problems. I removed #2 (operating from #1) with pcs cluster node remove n2 I verified that it had removed all configuration from #2, both for corosync and for pacemaker. The cluster looks like working correctly with two nodes (and no traces of #2). Now I am trying to add the third node back. pcs cluster node add n2 Disabling SBD service... n2: sbd disabled Sending 'corosync authkey', 'pacemaker authkey' to 'n2' n2: successful distribution of the file 'corosync authkey' n2: successful distribution of the file 'pacemaker authkey' Sending updated corosync.conf to nodes... n3: Succeeded n2: Succeeded n1: Succeeded n3: Corosync configuration reloaded I am able to start #2 operating from #1 pcs cluster pcsd-status n2: Online n3: Online n1: Online pcs cluster enable n2 pcs cluster start n2 I can see that corosync's configuration has been updated, but pacemaker's not. _Checking from #1:_ pcs config Cluster Name: n Corosync Nodes: n1 n3 n2 Pacemaker Nodes: n1 n3 [...] pcs status * 2 nodes configured Node List: * Online: [ n1 n3 ] [...] pcs cluster cib scope=nodes _#2 is seeing the state differently:_ pcs config Cluster Name: n Corosync Nodes: n1 n3 n2 Pacemaker Nodes: n1 n2 n3 pcs status * 3 nodes configured Node List: * Online: [ n2 ] * OFFLINE: [ n1 n3 ] Full List of Resources: * No resources [...] (there are resources configured on #1 and #3) pcs cluster cib scope=nodes Help me diagnose it please. Where should I look for the problem? (I have already tried a few things more - I see nothing helpful in log files, pcs --debug shows nothing suspicious, tried even editing the CIB manually) Best regards, Piotr Szafarczyk ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Cannot add a node with pcs
Hi Piotr, Based on 'pcs cluster node add n2' and 'pcs config' outputs, pcs added the node to your cluster successfully, that is corosync config has been modified, distributed and loaded. It looks like the problem is with pacemaker. This is a wild guess, but maybe pacemaker wants to fence n2, which is not possible, as you disabled stonith. In the meantime, n1 and n3 do not allow n2 to join, until it's confirmed fenced. Try looking into / posting 'pcs status --full' and pacemaker log. With stonith disabled, you have a working cluster (seemingly). Until you don't, due to an event which requires working stonith for the cluster to recover. Regards, Tomas Dne 12. 07. 22 v 12:34 Piotr Szafarczyk napsal(a): Hi, I used to have a working cluster with 3 nodes (and stonith disabled). After an unexpected restart of one node, the cluster split. The node #2 started to see the others as unclean. Nodes 1 and 2 were cooperating with each other, showing #2 as offline. There were no network connection problems. I removed #2 (operating from #1) with pcs cluster node remove n2 I verified that it had removed all configuration from #2, both for corosync and for pacemaker. The cluster looks like working correctly with two nodes (and no traces of #2). Now I am trying to add the third node back. pcs cluster node add n2 Disabling SBD service... n2: sbd disabled Sending 'corosync authkey', 'pacemaker authkey' to 'n2' n2: successful distribution of the file 'corosync authkey' n2: successful distribution of the file 'pacemaker authkey' Sending updated corosync.conf to nodes... n3: Succeeded n2: Succeeded n1: Succeeded n3: Corosync configuration reloaded I am able to start #2 operating from #1 pcs cluster pcsd-status n2: Online n3: Online n1: Online pcs cluster enable n2 pcs cluster start n2 I can see that corosync's configuration has been updated, but pacemaker's not. _Checking from #1:_ pcs config Cluster Name: n Corosync Nodes: n1 n3 n2 Pacemaker Nodes: n1 n3 [...] pcs status * 2 nodes configured Node List: * Online: [ n1 n3 ] [...] pcs cluster cib scope=nodes _#2 is seeing the state differently:_ pcs config Cluster Name: n Corosync Nodes: n1 n3 n2 Pacemaker Nodes: n1 n2 n3 pcs status * 3 nodes configured Node List: * Online: [ n2 ] * OFFLINE: [ n1 n3 ] Full List of Resources: * No resources [...] (there are resources configured on #1 and #3) pcs cluster cib scope=nodes Help me diagnose it please. Where should I look for the problem? (I have already tried a few things more - I see nothing helpful in log files, pcs --debug shows nothing suspicious, tried even editing the CIB manually) Best regards, Piotr Szafarczyk ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/