Re: [ClusterLabs] 32 nodes pacemaker cluster setup issue

Vladislav Bogdanov Wed, 19 May 2021 04:26:10 -0700

Hi.

Have you considered using pacemaker-remote instead?



On May 18, 2021 5:55:57 PM S Sathish S <s.s.sath...@ericsson.com> wrote:

Hi Team,
We are setup 32 nodes pacemaker cluster setup each node has 10 resource sototal [around 300+ components] are up and running. While performinginstallation/update with below task will happen.
From First node we start adding all 31 nodes one-by-one into the clusterand added resource for each nodes.we execute pcs command stop/start resource parallelly in some use-case forall nodes.If any network related change in node , we kept pcs in maintenance mode andpost that network change disable pcs maintenance mode.Some case we use to reboot the node one-by-one also for somekernel/application changes to be reflected.
Till 9 node cluster is working fine for us we don’t see below reportedissue , For 32 node cluster setup we are facing below error whenever weperform installation/upgrade with above task is executed.
Please find the coroysnc logs in problematic duration with below errormessage :
May 17 08:08:47 [1978] node1 corosync notice [TOTEM ] A new membership(10.61.78.50:85864) was formed. Members left: 2 16 17 31 15 12 13 14 27 2829 30 20 32 18 7 22 19 24 25 10 5 6 26 23 21 11 3 4May 17 08:08:47 [1978] node1 corosync notice [TOTEM ] Failed to receivethe leave message. failed: 2 16 17 31 15 12 13 14 27 28 29 30 20 32 18 7 2219 24 25 10 5 6 26 23 21 11 3 4May 17 08:08:47 [1978] node1 corosync notice [QUORUM] This node is withinthe non-primary component and will NOT provide any services.
May 17 08:08:47 [1978] node1  corosync notice  [QUORUM] Members[1]: 1
May 17 08:08:47 [1978] node1 corosync notice [MAIN ] Completed servicesynchronization, ready to provide service.May 17 11:17:30 [1866] node1 corosync notice [MAIN ] Corosync ClusterEngine ('UNKNOWN'): started and ready to provide service.May 17 11:17:30 [1866] node1 corosync info [MAIN ] Corosync built-infeatures: pie relro bindnowMay 17 11:17:30 [1866] node1 corosync warning [MAIN ] Could not setSCHED_RR at priority 99: Operation not permitted (1)May 17 11:17:30 [1866] node1 corosync notice [TOTEM ] Initializingtransport (UDP/IP Unicast).May 17 11:17:30 [1866] node1 corosync notice [TOTEM ] Initializingtransmit/receive security (NSS) crypto: none hash: noneMay 17 11:17:30 [1866] node1 corosync notice [TOTEM ] The networkinterface [10.61.78.50] is now up.May 17 11:17:30 [1866] node1 corosync notice [SERV ] Service engineloaded: corosync configuration map access [0]
May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cmap
May 17 11:17:30 [1866] node1 corosync notice [SERV ] Service engineloaded: corosync configuration service [1]
May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cfg
May 17 11:17:30 [1866] node1 corosync notice [SERV ] Service engineloaded: corosync cluster closed process group service v1.01 [2]
May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cpg
May 17 11:17:30 [1866] node1 corosync notice [SERV ] Service engineloaded: corosync profile loading service [4]May 17 11:17:30 [1866] node1 corosync notice [QUORUM] Using quorumprovider corosync_votequorumMay 17 11:17:30 [1866] node1 corosync notice [SERV ] Service engineloaded: corosync vote quorum service v1.0 [5]
May 17 11:17:30 [1866] node1  corosync info    [QB    ] server name: votequorum
May 17 11:17:30 [1866] node1 corosync notice [SERV ] Service engineloaded: corosync cluster quorum service v0.1 [3]
May 17 11:17:30 [1866] node1  corosync info    [QB    ] server name: quorum

Another node logs :
May 18 16:20:17 [1968] node2 corosync notice [TOTEM ] A new membership(10.223.106.11:104056) was formed. Members left: 2 16 17 31 15 12 1 13 1427 28 29 30 20 7 22 8 9 19 24 25 10 5 6 26 23 11 3 4May 18 16:20:17 [1968] node2 corosync notice [TOTEM ] Failed to receivethe leave message. failed: 2 16 17 31 15 12 1 13 14 27 28 29 30 20 7 22 8 919 24 25 10 5 6 26 23 11 3 4May 18 16:20:17 [1968] node2 corosync notice [QUORUM] This node is withinthe non-primary component and will NOT provide any services.
May 18 16:20:17 [1968] node2 corosync notice  [QUORUM] Members[1]: 32
May 18 16:20:17 [1968] node2 corosync notice [MAIN ] Completed servicesynchronization, ready to provide service.May 18 16:22:20 [1968] node2 corosync notice [TOTEM ] A new membership(10.217.41.26:104104) was formed. Members joined: 27 29 18
May 18 16:22:20 [1968] node2 corosync notice  [QUORUM] Members[4]: 27 29 32 18
May 18 16:22:20 [1968] node2 corosync notice [MAIN ] Completed servicesynchronization, ready to provide service.May 18 16:22:45 [1968] node2 corosync notice [TOTEM ] A new membership(10.217.41.26:104112) was formed. Members
May 18 16:22:45 [1968] node2 corosync notice  [QUORUM] Members[4]: 27 29 32 18
May 18 16:22:45 [1968] node2 corosync notice [MAIN ] Completed servicesynchronization, ready to provide service.May 18 16:22:46 [1968] node2 corosync notice [TOTEM ] A new membership(10.217.41.26:104116) was formed. Members joined: 30May 18 16:22:46 [1968] node2 corosync notice [QUORUM] Members[5]: 27 29 3032 18
Any PCS command will fail with error message on all nodes:
[root@node1 online]# pcs property set maintenance-mode=false --wait=240
Error: Unable to update cib
Call cib_replace failed (-62): Timer expired
[root@node1 online]#
Workaround : we poweroff all nodes and bring nodes one-by-one to overcomeabove problem statement , kindly check on this error message and provide usRCA for this problem.
Current Pacemaker version :
pacemaker-2.0.2 -->
https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2
corosync-2.4.4 -->
https://github.com/corosync/corosync/tree/v2.4.4
pcs-0.9.169

Thanks and Regards,
S Sathish S
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] 32 nodes pacemaker cluster setup issue

Reply via email to