[ClusterLabs] Guest nodes in a pacemaker cluster
Hi - I have a 3 node master - slave - slave MySQL cluster setup using corosync\pacemaker stack. Now I want to introduce 4 more slaves to the configuration. However, I do not want these to be part of the quorum or participate in DC election etc. Could someone guide me on an recommended approach to do this ? Thanks! Prasad. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Regarding Finalization Timer (I_ELECTION) just popped (1800000ms)
Thank you for the response Ken. Will watch for this to be reproduced again. On Thu, May 16, 2019 at 4:10 AM Ken Gaillot wrote: > Resurrecting a really old thread in case anyone had similar questions. > This arrived at a crazy busy time and got neglected. > > The "finalization timer" is a timeout in the DC election process. The > value is the join-finalization-timeout cluster property (formerly crmd- > finalization-timeout), which defaults to 30 minutes. > > The whole election process is undocumented and quite arcane. It would > be nice to document it but that's a bigger project than there is time > for at the moment. > > The controller (crmd) is implemented as a finite state machine, meaning > various inputs move it from one state to another according to fixed > rules. The finalization timer is started when the "finalize join" state > is reached, and stopped whenever that state is left. This state is > achieved once the (possibly newly elected) DC has received join > requests from all nodes, and is left once the DC has sync'd the CIB to > those nodes, ack'd their join requests, and received confirmations back > from them. > > Obviously it's too late to look at this particular incident, but if it > can be reliably reproduced, I can take a new look. > > On Sun, 2018-10-28 at 23:32 +0530, Prasad Nagaraj wrote: > > Hi : > > > > I came across a strange situation in my cluster few days back. I was > > trying to replace one of the nodes in my 3 node cluster and I removed > > and added that node back following due process as per > > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_removing_a_corosync_node.html > > . Updated corosync.conf on all nodes and restarted corosync and > > pacemaker on all the nodes. The cluster didnt elect a DC and also > > didnt report any activities for almost 28 mins as seen from below > > logs. Then I saw this message: > > Oct 22 22:35:14 [76417] vm85c4465533 crmd: info: > > crm_timer_popped:Finalization Timer (I_ELECTION) just popped > > (180ms) > > after which I could see further activities happening including DC > > election. > > > > I was not able to understand or identify any reasons for this > > behavior and also there is absolutely no documentation on this > > Finalization timer and what it means. Appreciate any help in terms of > > explaining what exactly this timer means and what could be reasons > > for this behavior. I have pasted a snippet of logs during the time > > here. I do have more logs and also logs from other nodes that I can > > share if required. > > > > Thanks in advance for the help! > > Prasad > > > > > > Oct 22 22:07:46 [76412] vm85c4465533cib: info: > > cib_process_request: Completed cib_modify operation for section > > nodes: OK (rc=0, origin=vm46890219c5/crm_attribute/4, > > version=0.100.0) > > Oct 22 22:07:46 [76412] vm85c4465533cib: info: > > cib_file_write_with_digest: Reading cluster configuration file > > /var/lib/pacemaker/cib/cib.QC3Kay (digest: > > /var/lib/pacemaker/cib/cib.c74mjr) > > Oct 22 22:07:47 [76412] vm85c4465533cib: info: > > cib_file_backup: Archived previous version as > > /var/lib/pacemaker/cib/cib-96.raw > > Oct 22 22:07:47 [76412] vm85c4465533cib: info: > > cib_file_write_with_digest: Wrote version 0.100.0 of the CIB to disk > > (digest: 3ab4bffdc9c372985cfe50ad3100131d) > > Oct 22 22:07:47 [76412] vm85c4465533cib: info: > > cib_file_write_with_digest: Reading cluster configuration file > > /var/lib/pacemaker/cib/cib.7NqOEy (digest: > > /var/lib/pacemaker/cib/cib.gamHhs) > > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > > cib_perform_op: Diff: --- 0.100.0 2 > > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > > cib_perform_op: Diff: +++ 0.101.0 (null) > > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > > cib_perform_op: + /cib: @epoch=101 > > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > > cib_perform_op: ++ /cib/configuration/constraints: > > > node="vm46890219c5"/> > > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > > cib_process_request: Completed cib_apply_diff operation for section > > 'all': OK (rc=0, origin=local/cibadmin/2, version=0.101.0) > > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > > cib_file_backup: Archived previous version as > > /var/lib/pacemaker/cib/cib-97.raw > > Oct 22 22:07:48 [76412] vm85c4465533
Re: [ClusterLabs] Corosync unable to reach consensus for membership
Hello Jan, >Please block both input and output. Corosync isn't able to handle >byzantine faults. Thanks. It results in clean partition if I block both outgoing and incoming udp traffic to and from a given node. However, could you suggest me what is the best way to handle any real world production scenarios that may result in just one way traffic loss ? Thanks again. Prasad On Tue, Apr 30, 2019 at 5:26 PM Jan Friesse wrote: > Prasad, > > > Hello : > > > > I have a 3 node corosync and pacemaker cluster and the nodes are: > > Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ] > > > > Full list of resources: > > > > Master/Slave Set: ms_mysql [p_mysql] > > Masters: [ SG-azfw2-189 ] > > Slaves: [ SG-azfw2-190 SG-azfw2-191 ] > > > > For my network partition test, I created a firewall rule on Node > > SG-azfw2-190 to block all incoming udp traffic from node SG-azfw2-189 > > /sbin/iptables -I INPUT -p udp -s 172.19.0.13 -j DROP > > Please block both input and output. Corosync isn't able to handle > byzantine faults. > > Honza > > > > > I dont think corosync is correctly detecting the partition as I am > getting > > different membership information from different nodes. > > On node SG-azfw2-189, I still see the members as: > > > > Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ] > > > > Full list of resources: > > > > Master/Slave Set: ms_mysql [p_mysql] > > Masters: [ SG-azfw2-189 ] > > Slaves: [ SG-azfw2-190 SG-azfw2-191 ] > > > > whereas, on the node SG-azfw2-190, I see membership as > > > > Online: [ SG-azfw2-190 SG-azfw2-191 ] > > OFFLINE: [ SG-azfw2-189 ] > > > > Full list of resources: > > > > Master/Slave Set: ms_mysql [p_mysql] > > Slaves: [ SG-azfw2-190 SG-azfw2-191 ] > > Stopped: [ SG-azfw2-189 ] > > > > I expected that on node SG-azfw2-189, it should have detected that other > 2 > > nodes have left. In the corosync logs for this node, I continuously see > the > > below messages: > > Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4. > > Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the > > rep. > > Apr 30 11:00:03 corosync [MAIN ] Storing new sequence id for ring 2e64 > > Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state. > > Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state. > > Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4. > > Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the > > rep. > > Apr 30 11:00:33 corosync [MAIN ] Storing new sequence id for ring 2e68 > > Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state. > > Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state. > > > > On the other nodes - I see messages like > > notice: pcmk_peer_update: Transitional membership event on ring 11888: > > memb=2, new=0, lost=0 > > Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb: > > SG-azfw2-190 301994924 > > Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb: > > SG-azfw2-191 603984812 > > Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1 > > Apr 30 11:06:10 corosync [pcmk ] notice: pcmk_peer_update: Stable > > membership event on ring 11888: memb=2, new=0, lost=0 > > Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > SG-azfw2-190 301994924 > > Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > SG-azfw2-191 603984812 > > Apr 30 11:06:10 corosync [SYNC ] This node is within the primary > component > > and will provide service. > > Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state. > > > > Can the corosync experts please guide me on probable root cause for this > or > > ways to debug this further ? Help much appreciated. > > > > corosync version: 1.4.8. > > pacemaker version: 1.1.14-8.el6_8.1 > > > > Thanks! > > > > > > > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Corosync unable to reach consensus for membership
Hello : I have a 3 node corosync and pacemaker cluster and the nodes are: Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Masters: [ SG-azfw2-189 ] Slaves: [ SG-azfw2-190 SG-azfw2-191 ] For my network partition test, I created a firewall rule on Node SG-azfw2-190 to block all incoming udp traffic from node SG-azfw2-189 /sbin/iptables -I INPUT -p udp -s 172.19.0.13 -j DROP I dont think corosync is correctly detecting the partition as I am getting different membership information from different nodes. On node SG-azfw2-189, I still see the members as: Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Masters: [ SG-azfw2-189 ] Slaves: [ SG-azfw2-190 SG-azfw2-191 ] whereas, on the node SG-azfw2-190, I see membership as Online: [ SG-azfw2-190 SG-azfw2-191 ] OFFLINE: [ SG-azfw2-189 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Slaves: [ SG-azfw2-190 SG-azfw2-191 ] Stopped: [ SG-azfw2-189 ] I expected that on node SG-azfw2-189, it should have detected that other 2 nodes have left. In the corosync logs for this node, I continuously see the below messages: Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4. Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the rep. Apr 30 11:00:03 corosync [MAIN ] Storing new sequence id for ring 2e64 Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state. Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state. Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4. Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the rep. Apr 30 11:00:33 corosync [MAIN ] Storing new sequence id for ring 2e68 Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state. Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state. On the other nodes - I see messages like notice: pcmk_peer_update: Transitional membership event on ring 11888: memb=2, new=0, lost=0 Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb: SG-azfw2-190 301994924 Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb: SG-azfw2-191 603984812 Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1 Apr 30 11:06:10 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 11888: memb=2, new=0, lost=0 Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB: SG-azfw2-190 301994924 Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB: SG-azfw2-191 603984812 Apr 30 11:06:10 corosync [SYNC ] This node is within the primary component and will provide service. Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state. Can the corosync experts please guide me on probable root cause for this or ways to debug this further ? Help much appreciated. corosync version: 1.4.8. pacemaker version: 1.1.14-8.el6_8.1 Thanks! ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Regarding Finalization Timer (I_ELECTION) just popped (1800000ms)
Hi - Any help on this will be very much appreciated. Thanks! Prasad On Sun, Oct 28, 2018 at 11:32 PM Prasad Nagaraj wrote: > Hi : > > I came across a strange situation in my cluster few days back. I was > trying to replace one of the nodes in my 3 node cluster and I removed and > added that node back following due process as per > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_removing_a_corosync_node.html. > Updated corosync.conf on all nodes and restarted corosync and pacemaker on > all the nodes. The cluster didnt elect a DC and also didnt report any > activities for almost 28 mins as seen from below logs. Then I saw this > message: > Oct 22 22:35:14 [76417] vm85c4465533 crmd: info: > crm_timer_popped:Finalization Timer (I_ELECTION) just popped (180ms) > after which I could see further activities happening including DC election. > > I was not able to understand or identify any reasons for this behavior and > also there is absolutely no documentation on this Finalization timer and > what it means. Appreciate any help in terms of explaining what exactly this > timer means and what could be reasons for this behavior. I have pasted a > snippet of logs during the time here. I do have more logs and also logs > from other nodes that I can share if required. > > Thanks in advance for the help! > Prasad > > > Oct 22 22:07:46 [76412] vm85c4465533cib: info: > cib_process_request: Completed cib_modify operation for section nodes: OK > (rc=0, origin=vm46890219c5/crm_attribute/4, version=0.100.0) > Oct 22 22:07:46 [76412] vm85c4465533cib: info: > cib_file_write_with_digest: Reading cluster configuration file > /var/lib/pacemaker/cib/cib.QC3Kay (digest: > /var/lib/pacemaker/cib/cib.c74mjr) > Oct 22 22:07:47 [76412] vm85c4465533cib: info: > cib_file_backup: Archived previous version as > /var/lib/pacemaker/cib/cib-96.raw > Oct 22 22:07:47 [76412] vm85c4465533cib: info: > cib_file_write_with_digest: Wrote version 0.100.0 of the CIB to disk > (digest: 3ab4bffdc9c372985cfe50ad3100131d) > Oct 22 22:07:47 [76412] vm85c4465533cib: info: > cib_file_write_with_digest: Reading cluster configuration file > /var/lib/pacemaker/cib/cib.7NqOEy (digest: > /var/lib/pacemaker/cib/cib.gamHhs) > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > cib_perform_op: Diff: --- 0.100.0 2 > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > cib_perform_op: Diff: +++ 0.101.0 (null) > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > cib_perform_op: + /cib: @epoch=101 > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > cib_perform_op: ++ /cib/configuration/constraints: id="ms_mysql_member453" rsc="ms_mysql" score="0" node="vm46890219c5"/> > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > cib_process_request: Completed cib_apply_diff operation for section 'all': > OK (rc=0, origin=local/cibadmin/2, version=0.101.0) > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > cib_file_backup: Archived previous version as > /var/lib/pacemaker/cib/cib-97.raw > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > cib_file_write_with_digest: Wrote version 0.101.0 of the CIB to disk > (digest: 7ffa91a8ca752581d4c1df13d287e467) > Oct 22 22:07:48 [76412] vm85c4465533cib: info: > cib_file_write_with_digest: Reading cluster configuration file > /var/lib/pacemaker/cib/cib.JkMV2C (digest: > /var/lib/pacemaker/cib/cib.mLy23A) > Oct 22 22:07:49 [76412] vm85c4465533cib: info: > cib_perform_op: Diff: --- 0.101.0 2 > Oct 22 22:07:49 [76412] vm85c4465533cib: info: > cib_perform_op: Diff: +++ 0.102.0 (null) > Oct 22 22:07:49 [76412] vm85c4465533cib: info: > cib_perform_op: + /cib: @epoch=102 > Oct 22 22:07:49 [76412] vm85c4465533cib: info: > cib_perform_op: + > /cib/configuration/resources/master[@id='ms_mysql']/meta_attributes[@id='ms_mysql-meta_attributes']/nvpair[@id='ms_mysql-meta_attributes-maintenance']: > @value=false > Oct 22 22:07:49 [76412] vm85c4465533cib: info: > cib_process_request: Completed cib_modify operation for section resources: > OK (rc=0, origin=local/crm_resource/6, version=0.102.0) > Oct 22 22:07:49 [76412] vm85c4465533cib: info: > cib_file_backup: Archived previous version as > /var/lib/pacemaker/cib/cib-98.raw > Oct 22 22:07:49 [76412] vm85c4465533cib: info: > cib_file_write_with_digest: Wrote version 0.102.0 of the CIB to disk > (digest: dd6263dc226d652721b09ec702e37742) > Oct 22 22:07:49 [76412] vm85c4465533
[ClusterLabs] Regarding Finalization Timer (I_ELECTION) just popped (1800000ms)
Hi : I came across a strange situation in my cluster few days back. I was trying to replace one of the nodes in my 3 node cluster and I removed and added that node back following due process as per https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_removing_a_corosync_node.html. Updated corosync.conf on all nodes and restarted corosync and pacemaker on all the nodes. The cluster didnt elect a DC and also didnt report any activities for almost 28 mins as seen from below logs. Then I saw this message: Oct 22 22:35:14 [76417] vm85c4465533 crmd: info: crm_timer_popped:Finalization Timer (I_ELECTION) just popped (180ms) after which I could see further activities happening including DC election. I was not able to understand or identify any reasons for this behavior and also there is absolutely no documentation on this Finalization timer and what it means. Appreciate any help in terms of explaining what exactly this timer means and what could be reasons for this behavior. I have pasted a snippet of logs during the time here. I do have more logs and also logs from other nodes that I can share if required. Thanks in advance for the help! Prasad Oct 22 22:07:46 [76412] vm85c4465533cib: info: cib_process_request: Completed cib_modify operation for section nodes: OK (rc=0, origin=vm46890219c5/crm_attribute/4, version=0.100.0) Oct 22 22:07:46 [76412] vm85c4465533cib: info: cib_file_write_with_digest: Reading cluster configuration file /var/lib/pacemaker/cib/cib.QC3Kay (digest: /var/lib/pacemaker/cib/cib.c74mjr) Oct 22 22:07:47 [76412] vm85c4465533cib: info: cib_file_backup: Archived previous version as /var/lib/pacemaker/cib/cib-96.raw Oct 22 22:07:47 [76412] vm85c4465533cib: info: cib_file_write_with_digest: Wrote version 0.100.0 of the CIB to disk (digest: 3ab4bffdc9c372985cfe50ad3100131d) Oct 22 22:07:47 [76412] vm85c4465533cib: info: cib_file_write_with_digest: Reading cluster configuration file /var/lib/pacemaker/cib/cib.7NqOEy (digest: /var/lib/pacemaker/cib/cib.gamHhs) Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_perform_op: Diff: --- 0.100.0 2 Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_perform_op: Diff: +++ 0.101.0 (null) Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_perform_op: + /cib: @epoch=101 Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_perform_op: ++ /cib/configuration/constraints: Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=local/cibadmin/2, version=0.101.0) Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_file_backup: Archived previous version as /var/lib/pacemaker/cib/cib-97.raw Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_file_write_with_digest: Wrote version 0.101.0 of the CIB to disk (digest: 7ffa91a8ca752581d4c1df13d287e467) Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_file_write_with_digest: Reading cluster configuration file /var/lib/pacemaker/cib/cib.JkMV2C (digest: /var/lib/pacemaker/cib/cib.mLy23A) Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_perform_op: Diff: --- 0.101.0 2 Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_perform_op: Diff: +++ 0.102.0 (null) Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_perform_op: + /cib: @epoch=102 Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_perform_op: + /cib/configuration/resources/master[@id='ms_mysql']/meta_attributes[@id='ms_mysql-meta_attributes']/nvpair[@id='ms_mysql-meta_attributes-maintenance']: @value=false Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_process_request: Completed cib_modify operation for section resources: OK (rc=0, origin=local/crm_resource/6, version=0.102.0) Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_file_backup: Archived previous version as /var/lib/pacemaker/cib/cib-98.raw Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_file_write_with_digest: Wrote version 0.102.0 of the CIB to disk (digest: dd6263dc226d652721b09ec702e37742) Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_file_write_with_digest: Reading cluster configuration file /var/lib/pacemaker/cib/cib.DOunDE (digest: /var/lib/pacemaker/cib/cib.MYBrjE) Oct 22 22:35:14 [76417] vm85c4465533 crmd: info: crm_timer_popped:Finalization Timer (I_ELECTION) just popped (180ms) Oct 22 22:35:14 [76417] vm85c4465533 crmd: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_ELECTION [ input=I_ELECTION cause=C_TIMER_POPPED origin=crm_timer_popped ] Oct 22 22:35:14 [76417] vm85c4465533 crmd: info: update_dc: Unset DC. Was vm85c4465533 Oct 22 22:35:14 [76417] vm85c4465533 crmd: info:
Re: [ClusterLabs] Understanding the behavior of pacemaker crash
Hi Ken - Only if I turn off corosync on the node [ where I crashed pacemaker] other nodes are able to detect and put the node as OFFLINE. Do you have any other guidance or insights into this ? Thanks Prasad On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj wrote: > Hi Ken - Thanks for the response. Pacemaker is still not running on that > node. So I am still wondering what could be the issue ? Any other > configurations or logs should I be sharing to understand this more ? > > Thanks! > > On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot wrote: > >> On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote: >> > Hello - I was trying to understand the behavior or cluster when >> > pacemaker crashes on one of the nodes. So I hard killed pacemakerd >> > and its related processes. >> > >> > --- >> > - >> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > root 74022 1 0 07:53 pts/000:00:00 pacemakerd >> > 189 74028 74022 0 07:53 ?00:00:00 >> > /usr/libexec/pacemaker/cib >> > root 74029 74022 0 07:53 ?00:00:00 >> > /usr/libexec/pacemaker/stonithd >> > root 74030 74022 0 07:53 ?00:00:00 >> > /usr/libexec/pacemaker/lrmd >> > 189 74031 74022 0 07:53 ?00:00:00 >> > /usr/libexec/pacemaker/attrd >> > 189 74032 74022 0 07:53 ?00:00:00 >> > /usr/libexec/pacemaker/pengine >> > 189 74033 74022 0 07:53 ?00:00:00 >> > /usr/libexec/pacemaker/crmd >> > >> > root 75228 50092 0 07:54 pts/000:00:00 grep pacemaker >> > [root@SG-mysqlold-907 azureuser]# kill -9 74022 >> > >> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > root 74030 1 0 07:53 ?00:00:00 >> > /usr/libexec/pacemaker/lrmd >> > 189 74032 1 0 07:53 ?00:00:00 >> > /usr/libexec/pacemaker/pengine >> > >> > root 75303 50092 0 07:55 pts/000:00:00 grep pacemaker >> > [root@SG-mysqlold-907 azureuser]# kill -9 74030 >> > [root@SG-mysqlold-907 azureuser]# kill -9 74032 >> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker >> > root 75332 50092 0 07:55 pts/000:00:00 grep pacemaker >> > >> > [root@SG-mysqlold-907 azureuser]# crm satus >> > ERROR: status: crm_mon (rc=107): Connection to cluster failed: >> > Transport endpoint is not connected >> > --- >> > -- >> > >> > However, this does not seem to be having any effect on the cluster >> > status from other nodes >> > --- >> > >> > >> > [root@SG-mysqlold-909 azureuser]# crm status >> > Last updated: Thu Sep 27 07:56:17 2018 Last change: Thu Sep >> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 >> > Stack: classic openais (with plugin) >> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - >> > partition with quorum >> > 3 nodes and 3 resources configured, 3 expected votes >> > >> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] >> >> It most definitely would make the node offline, and if fencing were >> configured, the rest of the cluster would fence the node to make sure >> it's safely down. >> >> I see you're using the old corosync 1 plugin. I suspect what happened >> in this case is that corosync noticed the plugin died and restarted it >> quickly enough that it had rejoined by the time you checked the status >> elsewhere. >> >> > >> > Full list of resources: >> > >> > Master/Slave Set: ms_mysql [p_mysql] >> > Masters: [ SG-mysqlold-909 ] >> > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] >> > >> > >> > [root@SG-mysqlold-908 azureuser]# crm status >> > Last updated: Thu Sep 27 07:56:08 2018 Last change: Thu Sep >> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 >> > Stack: classic openais (with plugin) >> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - >> > partition with quorum >> > 3 nodes and 3 resources configured, 3 expected votes >> > >> >
Re: [ClusterLabs] Understanding the behavior of pacemaker crash
Hi Ken - Thanks for the response. Pacemaker is still not running on that node. So I am still wondering what could be the issue ? Any other configurations or logs should I be sharing to understand this more ? Thanks! On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot wrote: > On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote: > > Hello - I was trying to understand the behavior or cluster when > > pacemaker crashes on one of the nodes. So I hard killed pacemakerd > > and its related processes. > > > > --- > > - > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker > > root 74022 1 0 07:53 pts/000:00:00 pacemakerd > > 189 74028 74022 0 07:53 ?00:00:00 > > /usr/libexec/pacemaker/cib > > root 74029 74022 0 07:53 ?00:00:00 > > /usr/libexec/pacemaker/stonithd > > root 74030 74022 0 07:53 ?00:00:00 > > /usr/libexec/pacemaker/lrmd > > 189 74031 74022 0 07:53 ?00:00:00 > > /usr/libexec/pacemaker/attrd > > 189 74032 74022 0 07:53 ?00:00:00 > > /usr/libexec/pacemaker/pengine > > 189 74033 74022 0 07:53 ?00:00:00 > > /usr/libexec/pacemaker/crmd > > > > root 75228 50092 0 07:54 pts/000:00:00 grep pacemaker > > [root@SG-mysqlold-907 azureuser]# kill -9 74022 > > > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker > > root 74030 1 0 07:53 ?00:00:00 > > /usr/libexec/pacemaker/lrmd > > 189 74032 1 0 07:53 ?00:00:00 > > /usr/libexec/pacemaker/pengine > > > > root 75303 50092 0 07:55 pts/000:00:00 grep pacemaker > > [root@SG-mysqlold-907 azureuser]# kill -9 74030 > > [root@SG-mysqlold-907 azureuser]# kill -9 74032 > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker > > root 75332 50092 0 07:55 pts/000:00:00 grep pacemaker > > > > [root@SG-mysqlold-907 azureuser]# crm satus > > ERROR: status: crm_mon (rc=107): Connection to cluster failed: > > Transport endpoint is not connected > > --- > > -- > > > > However, this does not seem to be having any effect on the cluster > > status from other nodes > > --- > > > > > > [root@SG-mysqlold-909 azureuser]# crm status > > Last updated: Thu Sep 27 07:56:17 2018 Last change: Thu Sep > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 > > Stack: classic openais (with plugin) > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - > > partition with quorum > > 3 nodes and 3 resources configured, 3 expected votes > > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] > > It most definitely would make the node offline, and if fencing were > configured, the rest of the cluster would fence the node to make sure > it's safely down. > > I see you're using the old corosync 1 plugin. I suspect what happened > in this case is that corosync noticed the plugin died and restarted it > quickly enough that it had rejoined by the time you checked the status > elsewhere. > > > > > Full list of resources: > > > > Master/Slave Set: ms_mysql [p_mysql] > > Masters: [ SG-mysqlold-909 ] > > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] > > > > > > [root@SG-mysqlold-908 azureuser]# crm status > > Last updated: Thu Sep 27 07:56:08 2018 Last change: Thu Sep > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 > > Stack: classic openais (with plugin) > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - > > partition with quorum > > 3 nodes and 3 resources configured, 3 expected votes > > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] > > > > Full list of resources: > > > > Master/Slave Set: ms_mysql [p_mysql] > > Masters: [ SG-mysqlold-909 ] > > Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] > > > > --- > > --- > > > > I am bit surprised that other nodes are not able to detect that > > pacemaker is down on one of the nodes - SG-mysqlold-907 > > > > Even if I kill pacemaker on the node which is a DC -
[ClusterLabs] Understanding the behavior of pacemaker crash
Hello - I was trying to understand the behavior or cluster when pacemaker crashes on one of the nodes. So I hard killed pacemakerd and its related processes. [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker root 74022 1 0 07:53 pts/000:00:00 pacemakerd 189 74028 74022 0 07:53 ?00:00:00 /usr/libexec/pacemaker/cib root 74029 74022 0 07:53 ?00:00:00 /usr/libexec/pacemaker/stonithd root 74030 74022 0 07:53 ?00:00:00 /usr/libexec/pacemaker/lrmd 189 74031 74022 0 07:53 ?00:00:00 /usr/libexec/pacemaker/attrd 189 74032 74022 0 07:53 ?00:00:00 /usr/libexec/pacemaker/pengine 189 74033 74022 0 07:53 ?00:00:00 /usr/libexec/pacemaker/crmd root 75228 50092 0 07:54 pts/000:00:00 grep pacemaker [root@SG-mysqlold-907 azureuser]# kill -9 74022 [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker root 74030 1 0 07:53 ?00:00:00 /usr/libexec/pacemaker/lrmd 189 74032 1 0 07:53 ?00:00:00 /usr/libexec/pacemaker/pengine root 75303 50092 0 07:55 pts/000:00:00 grep pacemaker [root@SG-mysqlold-907 azureuser]# kill -9 74030 [root@SG-mysqlold-907 azureuser]# kill -9 74032 [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker root 75332 50092 0 07:55 pts/000:00:00 grep pacemaker [root@SG-mysqlold-907 azureuser]# crm satus ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected - However, this does not seem to be having any effect on the cluster status from other nodes --- [root@SG-mysqlold-909 azureuser]# crm status Last updated: Thu Sep 27 07:56:17 2018 Last change: Thu Sep 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 Stack: classic openais (with plugin) Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - partition with quorum 3 nodes and 3 resources configured, 3 expected votes Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Masters: [ SG-mysqlold-909 ] Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] [root@SG-mysqlold-908 azureuser]# crm status Last updated: Thu Sep 27 07:56:08 2018 Last change: Thu Sep 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909 Stack: classic openais (with plugin) Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - partition with quorum 3 nodes and 3 resources configured, 3 expected votes Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Masters: [ SG-mysqlold-909 ] Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ] -- I am bit surprised that other nodes are not able to detect that pacemaker is down on one of the nodes - SG-mysqlold-907 Even if I kill pacemaker on the node which is a DC - I observe the same behavior with rest of the nodes not detecting that DC is down. Could some one explain what is the expected behavior in these cases ? I am using corosync 1.4.7 and pacemaker 1.1.14 Thanks in advance Prasad ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 'crm node standby' command failing with "Error performing operation: Communication error on send . Return code is 70"
Hi - Yesterday, I noticed that when I am trying to execute 'crm node standby' command on one of my cluster nodes, it was failing with "Error performing operation: Communication error on send . Return code is 70" My corosync logs had these entries during that time: Sep 20 22:14:54 [4454] vm5c336912f1 crmd: notice: throttle_handle_load: High CPU load detected: 1.85 Sep 20 22:14:57 [4449] vm5c336912f1cib: info: cib_process_ping: Reporting our current digest to vmb546073338: 8fe67fcfcd20515c246c225a124a8902 for 0.481.2 (0x2742230 0) Sep 20 22:15:09 [4449] vm5c336912f1cib: info: cib_process_request: Forwarding cib_modify operation for section nodes to master (origin=local/crm_attribute/4) Sep 20 22:15:24 [4454] vm5c336912f1 crmd: notice: throttle_handle_load: High CPU load detected: 1.64 Sep 20 22:15:54 [4454] vm5c336912f1 crmd: info: throttle_handle_load: Moderate CPU load detected: 0.99 Sep 20 22:15:54 [4454] vm5c336912f1 crmd: info: throttle_send_command:New throttle mode: 0010 (was 0100) Sep 20 22:16:24 [4454] vm5c336912f1 crmd: info: throttle_send_command:New throttle mode: 0001 (was 0010) Sep 20 22:16:54 [4454] vm5c336912f1 crmd: info: throttle_send_command:New throttle mode: (was 0001) Sep 20 22:17:09 [4449] vm5c336912f1cib: info: cib_process_request: Forwarding cib_modify operation for section nodes to master (origin=local/crm_attribute/4) Sep 20 22:19:10 [4449] vm5c336912f1cib: info: cib_process_request: Forwarding cib_modify operation for section nodes to master (origin=local/crm_attribute/4) Sep 20 22:23:08 [4449] vm5c336912f1cib: info: cib_perform_op: Diff: --- 0.481.2 2 Sep 20 22:23:08 [4449] vm5c336912f1cib: info: cib_perform_op: Diff: +++ 0.482.0 9bacc862b8713430c81ea91694942a41 Sep 20 22:23:08 [4449] vm5c336912f1cib: info: cib_perform_op: + /cib: @epoch=482, @num_updates=0 Is the above behavior due to pacemaker thinking that cluster is highly loaded and trying to throttle the execution of commands ? What is the best way to resolve or work-around such problems. We do have high io load on our cluster - which hosts mysql database. Also from the thread, https://lists.clusterlabs.org/pipermail/users/2017-May/005702.html it was asked : *>There is not much detail about “load-threshold”.*>* Please can someone share steps or any commands to modify “load-threshold”.* Could someone advise whether this is the way to control the throttling of cluster operations and how to set this parameter ? Thanks in advance, Prasad ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster
Hi - My systems are single core cpu VMs running on azure platform. I am running MySQL on the nodes that do generate high io load. And my bad , I meant to say 'High CPU load detected' logged by crmd and not corosync. Corosync logs messages like 'Corosync main process was not scheduled for.' kind of messages which inturn makes pacemaker monitor action to fail sometimes. Is increasing token timeout a solution for this or any other ways ? Thanks for the help Prasaf On Wed, 22 Aug 2018, 11:55 am Jan Friesse, wrote: > Prasad, > > > Thanks Ken and Ulrich. There is definitely high IO on the system with > > sometimes IOWAIT s of upto 90% > > I have come across some previous posts that IOWAIT is also considered as > > CPU load by Corosync. Is this true ? Does having high IO may lead > corosync > > complain as in " Corosync main process was not scheduled for..." or "Hi > > CPU load detected.." ? > > Yes it can. > > Corosync never logs "Hi CPU load detected...". > > > > > I will surely monitor the system more. > > Is that system VM or physical machine? Because " Corosync main process > was not scheduled for..." is usually happening on VMs where hosts are > highly overloaded. > > Honza > > > > > Thanks for your help. > > Prasad > > > > > > > > On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot > wrote: > > > >> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote: > >>>>>> Prasad Nagaraj schrieb am > >>>>>> 21.08.2018 um 11:42 in > >>> > >>> Nachricht > >>> : > >>>> Hi Ken - Thanks for you response. > >>>> > >>>> We do have seen messages in other cases like > >>>> corosync [MAIN ] Corosync main process was not scheduled for > >>>> 17314.4746 ms > >>>> (threshold is 8000. ms). Consider token timeout increase. > >>>> corosync [TOTEM ] A processor failed, forming new configuration. > >>>> > >>>> Is this the indication of a failure due to CPU load issues and will > >>>> this > >>>> get resolved if I upgrade to Corosync 2.x series ? > >> > >> Yes, most definitely this is a CPU issue. It means corosync isn't > >> getting enough CPU cycles to handle the cluster token before the > >> timeout is reached. > >> > >> Upgrading may indeed help, as recent versions ensure that corosync runs > >> with real-time priority in the kernel, and thus are more likely to get > >> CPU time when something of lower priority is consuming all the CPU. > >> > >> But of course, there is some underlying problem that should be > >> identified and addressed. Figure out what's maxing out the CPU or I/O. > >> Ulrich's monitoring suggestion is a good start. > >> > >>> Hi! > >>> > >>> I'd strongly recommend starting monitoring on your nodes, at least > >>> until you know what's going on. The good old UNIX sa (sysstat > >>> package) could be a starting point. I'd monitor CPU idle > >>> specifically. Then go for 100% device utilization, then look for > >>> network bottlenecks... > >>> > >>> A new corosync release cannot fix those, most likely. > >>> > >>> Regards, > >>> Ulrich > >>> > >>>> > >>>> In any case, for the current scenario, we did not see any > >>>> scheduling > >>>> related messages. > >>>> > >>>> Thanks for your help. > >>>> Prasad > >>>> > >>>> On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot > >>>> wrote: > >>>> > >>>>> On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote: > >>>>>> Hi: > >>>>>> > >>>>>> One of these days, I saw a spurious node loss on my 3-node > >>>>>> corosync > >>>>>> cluster with following logged in the corosync.log of one of the > >>>>>> nodes. > >>>>>> > >>>>>> Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: > >>>>>> Transitional membership event on ring 32: memb=2, new=0, lost=1 > >>>>>> Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > >>>>>> vm02d780875f 67114156 > >>>>>> Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > >>>>
Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster
Thanks Ken and Ulrich. There is definitely high IO on the system with sometimes IOWAIT s of upto 90% I have come across some previous posts that IOWAIT is also considered as CPU load by Corosync. Is this true ? Does having high IO may lead corosync complain as in " Corosync main process was not scheduled for..." or "High CPU load detected.." ? I will surely monitor the system more. Thanks for your help. Prasad On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot wrote: > On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote: > > > > > Prasad Nagaraj schrieb am > > > > > 21.08.2018 um 11:42 in > > > > Nachricht > > : > > > Hi Ken - Thanks for you response. > > > > > > We do have seen messages in other cases like > > > corosync [MAIN ] Corosync main process was not scheduled for > > > 17314.4746 ms > > > (threshold is 8000. ms). Consider token timeout increase. > > > corosync [TOTEM ] A processor failed, forming new configuration. > > > > > > Is this the indication of a failure due to CPU load issues and will > > > this > > > get resolved if I upgrade to Corosync 2.x series ? > > Yes, most definitely this is a CPU issue. It means corosync isn't > getting enough CPU cycles to handle the cluster token before the > timeout is reached. > > Upgrading may indeed help, as recent versions ensure that corosync runs > with real-time priority in the kernel, and thus are more likely to get > CPU time when something of lower priority is consuming all the CPU. > > But of course, there is some underlying problem that should be > identified and addressed. Figure out what's maxing out the CPU or I/O. > Ulrich's monitoring suggestion is a good start. > > > Hi! > > > > I'd strongly recommend starting monitoring on your nodes, at least > > until you know what's going on. The good old UNIX sa (sysstat > > package) could be a starting point. I'd monitor CPU idle > > specifically. Then go for 100% device utilization, then look for > > network bottlenecks... > > > > A new corosync release cannot fix those, most likely. > > > > Regards, > > Ulrich > > > > > > > > In any case, for the current scenario, we did not see any > > > scheduling > > > related messages. > > > > > > Thanks for your help. > > > Prasad > > > > > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot > > > wrote: > > > > > > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote: > > > > > Hi: > > > > > > > > > > One of these days, I saw a spurious node loss on my 3-node > > > > > corosync > > > > > cluster with following logged in the corosync.log of one of the > > > > > nodes. > > > > > > > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: > > > > > Transitional membership event on ring 32: memb=2, new=0, lost=1 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > > > > > vm02d780875f 67114156 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > > > > > vmfa2757171f 151000236 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: > > > > > vm728316982d 201331884 > > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: > > > > > Stable > > > > > membership event on ring 32: memb=2, new=0, lost=0 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > > > > vm02d780875f 67114156 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > > > > vmfa2757171f 151000236 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: > > > > > ais_mark_unseen_peer_dead: > > > > > Node vm728316982d was not seen in the previous transition > > > > > Aug 18 12:40:25 corosync [pcmk ] info: update_member: Node > > > > > 201331884/vm728316982d is now: lost > > > > > Aug 18 12:40:25 corosync [pcmk ] info: > > > > > send_member_notification: > > > > > Sending membership update 32 to 3 children > > > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left > > > > > the > > > > > membership and a new membership was formed. > > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info: > > > > > plugin_handle_membership: Membersh
[ClusterLabs] Spurious node loss in corosync cluster
Hi: One of these days, I saw a spurious node loss on my 3-node corosync cluster with following logged in the corosync.log of one of the nodes. Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 32: memb=2, new=0, lost=1 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: vm02d780875f 67114156 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: vmfa2757171f 151000236 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: vm728316982d 201331884 Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 32: memb=2, new=0, lost=0 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm02d780875f 67114156 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: vmfa2757171f 151000236 Aug 18 12:40:25 corosync [pcmk ] info: ais_mark_unseen_peer_dead: Node vm728316982d was not seen in the previous transition Aug 18 12:40:25 corosync [pcmk ] info: update_member: Node 201331884/vm728316982d is now: lost Aug 18 12:40:25 corosync [pcmk ] info: send_member_notification: Sending membership update 32 to 3 children Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info: plugin_handle_membership: Membership 32: quorum retained Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: crm_update_peer_state_iter: plugin_handle_membership: Node vm728316982d[201331884] - state is now lost (was member) Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: plugin_handle_membership: Membership 32: quorum retained Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: crm_update_peer_state_iter: plugin_handle_membership: Node vm728316982d[201331884] - state is now lost (was member) Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: peer_update_callback: vm728316982d is now lost (was member) Aug 18 12:40:25 [4548] vmfa2757171f crmd: warning: match_down_event: No match for shutdown action on vm728316982d Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: peer_update_callback: Stonith/shutdown of vm728316982d not matched Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: crm_update_peer_join: peer_update_callback: Node vm728316982d[201331884] - join-6 phase 4 -> 0 Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: abort_transition_graph: Transition aborted: Node failure (source=peer_update_callback:240, 1) Aug 18 12:40:25 [4543] vmfa2757171fcib: info: plugin_handle_membership: Membership 32: quorum retained Aug 18 12:40:25 [4543] vmfa2757171fcib: notice: crm_update_peer_state_iter: plugin_handle_membership: Node vm728316982d[201331884] - state is now lost (was member) Aug 18 12:40:25 [4543] vmfa2757171fcib: notice: crm_reap_dead_member: Removing vm728316982d/201331884 from the membership list Aug 18 12:40:25 [4543] vmfa2757171fcib: notice: reap_crm_member: Purged 1 peers with id=201331884 and/or uname=vm728316982d from the membership cache Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: crm_reap_dead_member: Removing vm728316982d/201331884 from the membership list Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: reap_crm_member: Purged 1 peers with id=201331884 and/or uname=vm728316982d from the membership cache However, within seconds, the node was able to join back. Aug 18 12:40:34 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 36: memb=3, new=1, lost=0 Aug 18 12:40:34 corosync [pcmk ] info: update_member: Node 201331884/vm728316982d is now: member Aug 18 12:40:34 corosync [pcmk ] info: pcmk_peer_update: NEW: vm728316982d 201331884 But this was enough time for the cluster to get into split brain kind of situation with a resource on the node vm728316982d being stopped because of this node loss detection. Could anyone help whether this could happen due to any transient network distortion or so ? Are there any configuration settings that can be applied in corosync.conf so that cluster is more resilient to such temporary distortions. Currently my corosync.conf looks like this : compatibility: whitetank totem { version: 2 secauth: on threads: 0 interface { member { memberaddr: 172.20.0.4 } member { memberaddr: 172.20.0.9 } member { memberaddr: 172.20.0.12 } bindnetaddr: 172.20.0.12 ringnumber: 0 mcastport: 5405 ttl: 1 } transport: udpu token: 1 token_retransmits_before_loss_const: 10 } logging { fileline: off to_stderr: yes to_logfile: yes to_syslog: no logfile: /var/log/cluster/corosync.log timestamp: on logger_subsys { subsys: AMF debug: off } } service { name: pacemaker ver: 1 } amf { mode: disabled } Thanks in advance for the help. Prasad
Re: [ClusterLabs] corosync not able to form cluster
Hi Christine - Thanks for looking into the logs. I also see that the node eventually comes out of GATHER state here: Jun 07 16:56:10 corosync [TOTEM ] entering GATHER state from 0. Jun 07 16:56:10 corosync [TOTEM ] Creating commit token because I am the rep. Does it mean, it has timed out or given up and then came out ? second point, I did see some unexpected entries when I did tcpdump on the node coro.4.. [ Its also pasted in one of the earlier threads] You can see that it was receiving messages like 10:23:17.117347 IP 172.22.0.13.50468 > 172.22.0.4.netsupport: UDP, length 332 10:23:17.140960 IP 172.22.0.8.50438 > 172.22.0.4.netsupport: UDP, length 82 10:23:17.141319 IP 172.22.0.6.38535 > 172.22.0.4.netsupport: UDP, length 156 Please note that 172.22.0.8 and 172.22.0.6 are not part of my group and I was wondering why these messages are coming ? Thanks! On Fri, Jun 8, 2018 at 2:34 PM, Christine Caulfield wrote: > On 07/06/18 18:32, Prasad Nagaraj wrote: > > Hi Christine - Got it:) > > > > I have collected few seconds of debug logs from all nodes after startup. > > Please find them attached. > > Please let me know if this will help us to identify rootcause. > > > > The problem is on the node coro.4 - it never gets out of the JOIN > > "Jun 07 16:55:37 corosync [TOTEM ] entering GATHER state from 11." > > process so something is wrong on that node, either a rogue routing table > entry, dangling iptables rule or even a broken NIC. > > Chrissie > > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync not able to form cluster
Hi - As you can see in the corosync.conf details - i have already kept debug: on Thanks! On Thu, 7 Jun 2018, 8:03 pm Christine Caulfield, wrote: > On 07/06/18 15:24, Prasad Nagaraj wrote: > > > > No iptables or otherwise firewalls are setup on these nodes. > > > > One observation is that each node sends messages on with its own ring > > sequence number which is not converging.. I have seen that in a good > > cluster, when nodes respond with same sequence number, the membership is > > automatically formed. But in our case, that is not the case. > > > > That's just a side-effect of the cluster not forming. It's not causing > it. Can you enable full corosync debugging (just add debug:on to the end > of the logging {} stanza) and see if that has any more useful > information (I only need the corosync bits, not the pcmk ones) > > Chrissie > > > Example: we can see that one node sends > > Jun 07 07:55:04 corosync [pcmk ] notice: pcmk_peer_update: Transitional > > membership event on ring 71084: memb=1, new=0, lost=0 > > . > > Jun 07 07:55:16 corosync [pcmk ] notice: pcmk_peer_update: Transitional > > membership event on ring 71096: memb=1, new=0, lost=0 > > Jun 07 07:55:16 corosync [pcmk ] notice: pcmk_peer_update: Stable > > membership event on ring 71096: memb=1, new=0, lost=0 > > > > other node sends messages with its own numbers > > Jun 07 07:55:12 corosync [pcmk ] notice: pcmk_peer_update: Transitional > > membership event on ring 71088: memb=1, new=0, lost=0 > > Jun 07 07:55:12 corosync [pcmk ] notice: pcmk_peer_update: Stable > > membership event on ring 71088: memb=1, new=0, lost=0 > > ... > > Jun 07 07:55:24 corosync [pcmk ] notice: pcmk_peer_update: Transitional > > membership event on ring 71100: memb=1, new=0, lost=0 > > Jun 07 07:55:24 corosync [pcmk ] notice: pcmk_peer_update: Stable > > membership event on ring 71100: memb=1, new=0, lost=0 > > > > Any idea why this happens, and why the seq. numbers from different nodes > > are not converging ? > > > > Thanks! > > > > > > > > > > > > ___ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync not able to form cluster
No iptables or otherwise firewalls are setup on these nodes. One observation is that each node sends messages on with its own ring sequence number which is not converging.. I have seen that in a good cluster, when nodes respond with same sequence number, the membership is automatically formed. But in our case, that is not the case. Example: we can see that one node sends Jun 07 07:55:04 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71084: memb=1, new=0, lost=0 . Jun 07 07:55:16 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71096: memb=1, new=0, lost=0 Jun 07 07:55:16 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71096: memb=1, new=0, lost=0 other node sends messages with its own numbers Jun 07 07:55:12 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71088: memb=1, new=0, lost=0 Jun 07 07:55:12 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71088: memb=1, new=0, lost=0 ... Jun 07 07:55:24 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71100: memb=1, new=0, lost=0 Jun 07 07:55:24 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71100: memb=1, new=0, lost=0 Any idea why this happens, and why the seq. numbers from different nodes are not converging ? Thanks! ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync not able to form cluster
joined or left the membership and a new membership was formed. Jun 07 10:49:40 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 78152: memb=1, new=0, lost=0 Jun 07 10:49:40 corosync [pcmk ] info: pcmk_peer_update: memb: vm2883711991 184555180 Jun 07 10:49:40 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 78152: memb=1, new=0, lost=0 Jun 07 10:49:40 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 10:49:40 corosync [pcmk ] info: update_member: 0x1576f20 Node 184555180 ((null)) born on: 78140 Jun 07 10:49:40 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 07 10:49:40 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.11) ; members(old:0 left:0) Jun 07 10:49:40 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 07 10:49:52 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 78160: memb=1, new=0, lost=0 Jun 07 10:49:52 corosync [pcmk ] info: pcmk_peer_update: memb: vm2883711991 184555180 Jun 07 10:49:52 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 78160: memb=1, new=0, lost=0 Jun 07 10:49:52 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 10:49:52 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. On Thu, Jun 7, 2018 at 4:01 PM, Prasad Nagaraj wrote: > Hi Christine - > > Thanks for looking into this and here are the details. > All the nodes are pingable from each other and actively exchanging > corosync packets from each other as seen from tcpdump > > Here is the ifconfig out from each of the node > # ifconfig > eth0 Link encap:Ethernet HWaddr 00:0D:3A:03:35:64 > inet addr:172.22.0.4 Bcast:172.22.0.255 Mask:255.255.255.0 > inet6 addr: fe80::20d:3aff:fe03:3564/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:3721169 errors:0 dropped:0 overruns:0 frame:0 > TX packets:2780455 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1229505889 (1.1 GiB) TX bytes:982021535 (936.5 MiB) > > loLink encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:1367018 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1367018 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:459591075 (438.3 MiB) TX bytes:459591075 (438.3 MiB) > > - > # ifconfig > eth0 Link encap:Ethernet HWaddr 00:0D:3A:03:38:D7 > inet addr:172.22.0.11 Bcast:172.22.0.255 Mask:255.255.255.0 > inet6 addr: fe80::20d:3aff:fe03:38d7/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4052226 errors:0 dropped:0 overruns:0 frame:0 > TX packets:3744671 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1930027786 (1.7 GiB) TX bytes:1180930029 (1.0 GiB) > > loLink encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:1394930 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1394930 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:508170210 (484.6 MiB) TX bytes:508170210 (484.6 MiB) > > -- > > # ifconfig > eth0 Link encap:Ethernet HWaddr 00:0D:3A:04:06:F6 > inet addr:172.22.0.13 Bcast:172.22.0.255 Mask:255.255.255.0 > inet6 addr: fe80::20d:3aff:fe04:6f6/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:3974698 errors:0 dropped:0 overruns:0 frame:0 > TX packets:3891546 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1903617077 (1.7 GiB) TX bytes:1234961001 (1.1 GiB) > > loLink encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:1503643 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1503643 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:541177718 (516.1
Re: [ClusterLabs] corosync not able to form cluster
22.0.4.netsupport: UDP, length 332 10:25:30.827563 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP, length 332 10:25:30.850832 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP, length 376 10:25:30.863531 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP, length 332 10:25:30.886664 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP, length 332 10:25:30.886691 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP, length 332 10:25:30.910820 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP, length 376 10:25:30.923403 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP, length 332 10:25:30.946507 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP, length 332 10:25:30.946531 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP, length 332 10:25:30.970931 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP, length 376 10:25:30.983055 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP, length 332 10:25:31.006306 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP, length 332 10:25:31.006339 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP, length 332 10:25:31.030207 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP, length 376 And here is the lsof output for each node. lsof -i | grep corosync corosync 47873 root 10u IPv4 1193147 0t0 UDP 172.22.0.4: netsupport corosync 47873 root 13u IPv4 1193151 0t0 UDP 172.22.0.4:45846 corosync 47873 root 14u IPv4 1193152 0t0 UDP 172.22.0.4:34060 corosync 47873 root 15u IPv4 1193153 0t0 UDP 172.22.0.4:40755 lsof -i | grep corosync corosync 11039 root 10u IPv4 54862 0t0 UDP 172.22.0.13: netsupport corosync 11039 root 13u IPv4 54869 0t0 UDP 172.22.0.13:50468 corosync 11039 root 14u IPv4 54870 0t0 UDP 172.22.0.13:57332 corosync 11039 root 15u IPv4 54871 0t0 UDP 172.22.0.13:46460 lsof -i | grep corosync corosync 75188 root 10u IPv4 1582737 0t0 UDP 172.22.0.11: netsupport corosync 75188 root 13u IPv4 1582741 0t0 UDP 172.22.0.11:54545 corosync 75188 root 14u IPv4 1582742 0t0 UDP 172.22.0.11:53213 corosync 75188 root 15u IPv4 1582743 0t0 UDP 172.22.0.11:44864 Thanks! On Thu, Jun 7, 2018 at 3:33 PM, Christine Caulfield wrote: > On 07/06/18 09:21, Prasad Nagaraj wrote: > > Hi - I am running corosync on 3 nodes of CentOS release 6.9 (Final). > > Corosync version is corosync-1.4.7. > > The nodes are not seeing each other and not able to form memberships. > > What I see is continuous message about " A processor joined or left the > > membership and a new membership was formed." > > For example:on node: vm2883711991 > > > > I can't draw any conclusions from the logs, we'd need to see what > corosync though it was binding to and the IP addresses of the hosts. > > Have a look at the start of the logs and see if they match what you'd > expect (ie are similar to the ones on the working clusters), Also check > using lsof, to see what addresses corosync is bound to. tcpdump on port > 5405 will show you if traffic is leaving the nodes and being received. > > Also check firewall settings and make sure the nodes can ping each other. > > If you're still stumped them feel free to post more info here for us to > look at, though if you have that configuration working on other nodes it > might be something in your environment > > Chrissie > > > > > > Jun 07 07:54:52 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > vm2883711991 184555180 > > Jun 07 07:54:52 corosync [TOTEM ] A processor joined or left the > > membership and a new membership was formed. > > Jun 07 07:54:52 corosync [CPG ] chosen downlist: sender r(0) > > ip(172.22.0.11) ; members(old:1 left:0) > > Jun 07 07:54:52 corosync [MAIN ] Completed service synchronization, > > ready to provide service. > > Jun 07 07:55:04 corosync [pcmk ] notice: pcmk_peer_update: Transitional > > membership event on ring 71084: memb=1, new=0, lost=0 > > Jun 07 07:55:04 corosync [pcmk ] info: pcmk_peer_update: memb: > > vm2883711991 184555180 > > Jun 07 07:55:04 corosync [pcmk ] notice: pcmk_peer_update: Stable > > membership event on ring 71084: memb=1, new=0, lost=0 > > Jun 07 07:55:04 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > vm2883711991 184555180 > > Jun 07 07:55:04 corosync [TOTEM ] A processor joined or left the > > membership and a new membership was formed. > > Jun 07 07:55:04 corosync [CPG ] chosen downlist: sender r(0) > > ip(172.22.0.11) ; members(old:1 left:0) > > Jun 07 07:55:04 corosync [MAIN ] Completed service synchronization, > > ready to
[ClusterLabs] corosync not able to form cluster
Hi - I am running corosync on 3 nodes of CentOS release 6.9 (Final). Corosync version is corosync-1.4.7. The nodes are not seeing each other and not able to form memberships. What I see is continuous message about " A processor joined or left the membership and a new membership was formed." For example:on node: vm2883711991 Jun 07 07:54:52 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 07:54:52 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 07 07:54:52 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.11) ; members(old:1 left:0) Jun 07 07:54:52 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 07 07:55:04 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71084: memb=1, new=0, lost=0 Jun 07 07:55:04 corosync [pcmk ] info: pcmk_peer_update: memb: vm2883711991 184555180 Jun 07 07:55:04 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71084: memb=1, new=0, lost=0 Jun 07 07:55:04 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 07:55:04 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 07 07:55:04 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.11) ; members(old:1 left:0) Jun 07 07:55:04 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 07 07:55:16 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71096: memb=1, new=0, lost=0 Jun 07 07:55:16 corosync [pcmk ] info: pcmk_peer_update: memb: vm2883711991 184555180 Jun 07 07:55:16 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71096: memb=1, new=0, lost=0 Jun 07 07:55:16 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 07:55:16 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 07 07:55:16 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.11) ; members(old:1 left:0) Jun 07 07:55:16 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 07 07:55:28 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71108: memb=1, new=0, lost=0 Jun 07 07:55:28 corosync [pcmk ] info: pcmk_peer_update: memb: vm2883711991 184555180 Jun 07 07:55:28 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71108: memb=1, new=0, lost=0 Jun 07 07:55:28 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 07:55:28 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 07 07:55:28 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.11) ; members(old:1 left:0) Jun 07 07:55:28 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 07 07:55:40 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71120: memb=1, new=0, lost=0 Jun 07 07:55:40 corosync [pcmk ] info: pcmk_peer_update: memb: vm2883711991 184555180 Jun 07 07:55:40 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71120: memb=1, new=0, lost=0 Jun 07 07:55:40 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 07:55:40 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 07 07:55:40 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.11) ; members(old:1 left:0) Jun 07 07:55:40 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 07 07:55:52 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71132: memb=1, new=0, lost=0 Jun 07 07:55:52 corosync [pcmk ] info: pcmk_peer_update: memb: vm2883711991 184555180 Jun 07 07:55:52 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71132: memb=1, new=0, lost=0 Jun 07 07:55:52 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 07:55:52 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 07 07:55:52 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.11) ; members(old:1 left:0) Jun 07 07:55:52 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 07 07:56:04 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 71144: memb=1, new=0, lost=0 Jun 07 07:56:04 corosync [pcmk ] info: pcmk_peer_update: memb: vm2883711991 184555180 Jun 07 07:56:04 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 71144: memb=1, new=0, lost=0 Jun 07 07:56:04 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm2883711991 184555180 Jun 07 07:56:04 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 07 07:56:17 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on
[ClusterLabs] Continuous membership events in Corosync
Hi - I am trying to set up a 3 node cluster. I have got the corosync up and running on all the nodes and all nodes have joined the cluster. However, in the corosync logs, I am seeing continuous and repeated messages about membership events on rings. It comes every 12 seconds on all nodes. Here is a sample: May 03 10:46:29 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 2264: memb=3, new=0, lost=0 May 03 10:46:29 corosync [pcmk ] info: pcmk_peer_update: memb: vme6c794899e 83891884 May 03 10:46:29 corosync [pcmk ] info: pcmk_peer_update: memb: vmc9d15655fe 151000748 May 03 10:46:29 corosync [pcmk ] info: pcmk_peer_update: memb: vm5e42438470 184555180 May 03 10:46:29 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 2264: memb=3, new=0, lost=0 May 03 10:46:29 corosync [pcmk ] info: pcmk_peer_update: MEMB: vme6c794899e 83891884 May 03 10:46:29 corosync [pcmk ] info: pcmk_peer_update: MEMB: vmc9d15655fe 151000748 May 03 10:46:29 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm5e42438470 184555180 May 03 10:46:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. May 03 10:46:29 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.5) ; members(old:3 left:0) May 03 10:46:29 corosync [MAIN ] Completed service synchronization, ready to provide service. May 03 10:46:41 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 2268: memb=3, new=0, lost=0 May 03 10:46:41 corosync [pcmk ] info: pcmk_peer_update: memb: vme6c794899e 83891884 May 03 10:46:41 corosync [pcmk ] info: pcmk_peer_update: memb: vmc9d15655fe 151000748 May 03 10:46:41 corosync [pcmk ] info: pcmk_peer_update: memb: vm5e42438470 184555180 May 03 10:46:41 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 2268: memb=3, new=0, lost=0 May 03 10:46:41 corosync [pcmk ] info: pcmk_peer_update: MEMB: vme6c794899e 83891884 May 03 10:46:41 corosync [pcmk ] info: pcmk_peer_update: MEMB: vmc9d15655fe 151000748 May 03 10:46:41 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm5e42438470 184555180 May 03 10:46:41 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. May 03 10:46:41 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.5) ; members(old:3 left:0) May 03 10:46:41 corosync [MAIN ] Completed service synchronization, ready to provide service. May 03 10:46:54 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 2272: memb=3, new=0, lost=0 May 03 10:46:54 corosync [pcmk ] info: pcmk_peer_update: memb: vme6c794899e 83891884 May 03 10:46:54 corosync [pcmk ] info: pcmk_peer_update: memb: vmc9d15655fe 151000748 May 03 10:46:54 corosync [pcmk ] info: pcmk_peer_update: memb: vm5e42438470 184555180 May 03 10:46:54 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 2272: memb=3, new=0, lost=0 May 03 10:46:54 corosync [pcmk ] info: pcmk_peer_update: MEMB: vme6c794899e 83891884 May 03 10:46:54 corosync [pcmk ] info: pcmk_peer_update: MEMB: vmc9d15655fe 151000748 May 03 10:46:54 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm5e42438470 184555180 May 03 10:46:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. May 03 10:46:54 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.5) ; members(old:3 left:0) May 03 10:46:54 corosync [MAIN ] Completed service synchronization, ready to provide service. May 03 10:47:06 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 2276: memb=3, new=0, lost=0 May 03 10:47:06 corosync [pcmk ] info: pcmk_peer_update: memb: vme6c794899e 83891884 May 03 10:47:06 corosync [pcmk ] info: pcmk_peer_update: memb: vmc9d15655fe 151000748 May 03 10:47:06 corosync [pcmk ] info: pcmk_peer_update: memb: vm5e42438470 184555180 May 03 10:47:06 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 2276: memb=3, new=0, lost=0 May 03 10:47:06 corosync [pcmk ] info: pcmk_peer_update: MEMB: vme6c794899e 83891884 May 03 10:47:06 corosync [pcmk ] info: pcmk_peer_update: MEMB: vmc9d15655fe 151000748 May 03 10:47:06 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm5e42438470 184555180 May 03 10:47:06 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. May 03 10:47:06 corosync [CPG ] chosen downlist: sender r(0) ip(172.22.0.5) ; members(old:3 left:0) May 03 10:47:06 corosync [MAIN ] Completed service synchronization, ready to provide service. My corosync version is :corosync-1.4.7-5.el6.x86_64 my corosync.conf is : compatibility: whitetank totem { version: 2 secauth: off threads: 0 interface { member { memberaddr: 172.22.0.5 } member { memberaddr: 172.22.0.9 } member { memberaddr: 172.22.0.11 } bindnetaddr: 172.22.0.5 ringnumber: 0