[ClusterLabs] Guest nodes in a pacemaker cluster

2020-01-14 Thread Prasad Nagaraj
Hi - I have a 3 node master - slave - slave MySQL cluster setup using
corosync\pacemaker stack.

Now I want to introduce 4 more slaves to the configuration. However, I do
not want these to be part of the quorum or participate in DC election etc.
Could someone guide me on an recommended approach to do this ?

Thanks!
Prasad.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Regarding Finalization Timer (I_ELECTION) just popped (1800000ms)

2019-05-18 Thread Prasad Nagaraj
Thank you for the response Ken. Will watch for this to be reproduced again.

On Thu, May 16, 2019 at 4:10 AM Ken Gaillot  wrote:

> Resurrecting a really old thread in case anyone had similar questions.
> This arrived at a crazy busy time and got neglected.
>
> The "finalization timer" is a timeout in the DC election process. The
> value is the join-finalization-timeout cluster property (formerly crmd-
> finalization-timeout), which defaults to 30 minutes.
>
> The whole election process is undocumented and quite arcane. It would
> be nice to document it but that's a bigger project than there is time
> for at the moment.
>
> The controller (crmd) is implemented as a finite state machine, meaning
> various inputs move it from one state to another according to fixed
> rules. The finalization timer is started when the "finalize join" state
> is reached, and stopped whenever that state is left. This state is
> achieved once the (possibly newly elected) DC has received join
> requests from all nodes, and is left once the DC has sync'd the CIB to
> those nodes, ack'd their join requests, and received confirmations back
> from them.
>
> Obviously it's too late to look at this particular incident, but if it
> can be reliably reproduced, I can take a new look.
>
> On Sun, 2018-10-28 at 23:32 +0530, Prasad Nagaraj wrote:
> > Hi :
> >
> > I came across  a strange situation in my cluster few days back. I was
> > trying to replace one of the nodes in my 3 node cluster and I removed
> > and added that node back following due process as per
> >
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_removing_a_corosync_node.html
> > . Updated corosync.conf on all nodes and restarted corosync and
> > pacemaker on all the nodes. The cluster didnt elect a DC and also
> > didnt report any activities for almost 28 mins as seen from below
> > logs. Then I saw this message:
> > Oct 22 22:35:14 [76417] vm85c4465533   crmd: info:
> > crm_timer_popped:Finalization Timer (I_ELECTION) just popped
> > (180ms)
> > after which I could see further activities happening including DC
> > election.
> >
> > I was not able to understand or identify any reasons for this
> > behavior and also there is absolutely no documentation on this
> > Finalization timer and what it means. Appreciate any help in terms of
> > explaining what exactly this timer means and what could be reasons
> > for this behavior. I have pasted a snippet of logs during the time
> > here. I do have more logs and also logs from other nodes that I can
> > share if required.
> >
> > Thanks in advance for the help!
> > Prasad
> >
> >
> > Oct 22 22:07:46 [76412] vm85c4465533cib: info:
> > cib_process_request: Completed cib_modify operation for section
> > nodes: OK (rc=0, origin=vm46890219c5/crm_attribute/4,
> > version=0.100.0)
> > Oct 22 22:07:46 [76412] vm85c4465533cib: info:
> > cib_file_write_with_digest:  Reading cluster configuration file
> > /var/lib/pacemaker/cib/cib.QC3Kay (digest:
> > /var/lib/pacemaker/cib/cib.c74mjr)
> > Oct 22 22:07:47 [76412] vm85c4465533cib: info:
> > cib_file_backup: Archived previous version as
> > /var/lib/pacemaker/cib/cib-96.raw
> > Oct 22 22:07:47 [76412] vm85c4465533cib: info:
> > cib_file_write_with_digest:  Wrote version 0.100.0 of the CIB to disk
> > (digest: 3ab4bffdc9c372985cfe50ad3100131d)
> > Oct 22 22:07:47 [76412] vm85c4465533cib: info:
> > cib_file_write_with_digest:  Reading cluster configuration file
> > /var/lib/pacemaker/cib/cib.7NqOEy (digest:
> > /var/lib/pacemaker/cib/cib.gamHhs)
> > Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> > cib_perform_op:  Diff: --- 0.100.0 2
> > Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> > cib_perform_op:  Diff: +++ 0.101.0 (null)
> > Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> > cib_perform_op:  +  /cib:  @epoch=101
> > Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> > cib_perform_op:  ++ /cib/configuration/constraints:
> >  > node="vm46890219c5"/>
> > Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> > cib_process_request: Completed cib_apply_diff operation for section
> > 'all': OK (rc=0, origin=local/cibadmin/2, version=0.101.0)
> > Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> > cib_file_backup: Archived previous version as
> > /var/lib/pacemaker/cib/cib-97.raw
> > Oct 22 22:07:48 [76412] vm85c4465533  

Re: [ClusterLabs] Corosync unable to reach consensus for membership

2019-05-01 Thread Prasad Nagaraj
Hello Jan,

>Please block both input and output. Corosync isn't able to handle
>byzantine faults.

Thanks. It results in clean partition if I block both outgoing and incoming
udp traffic to and from a given node.

However, could you suggest me what is the best way to handle any real world
production scenarios that may result in just one way traffic loss ?

Thanks again.
Prasad
On Tue, Apr 30, 2019 at 5:26 PM Jan Friesse  wrote:

> Prasad,
>
> > Hello :
> >
> > I have a 3 node corosync and pacemaker cluster and the nodes are:
> > Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
> >
> > Full list of resources:
> >
> >   Master/Slave Set: ms_mysql [p_mysql]
> >   Masters: [ SG-azfw2-189 ]
> >   Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
> >
> > For my network partition test, I created a firewall rule on Node
> > SG-azfw2-190   to block all incoming udp traffic from node SG-azfw2-189
> > /sbin/iptables -I  INPUT -p udp -s 172.19.0.13 -j DROP
>
> Please block both input and output. Corosync isn't able to handle
> byzantine faults.
>
> Honza
>
> >
> > I dont think corosync is correctly detecting the partition as I am
> getting
> > different membership information from different nodes.
> > On node  SG-azfw2-189, I still see the members as:
> >
> > Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
> >
> > Full list of resources:
> >
> >   Master/Slave Set: ms_mysql [p_mysql]
> >   Masters: [ SG-azfw2-189 ]
> >   Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
> >
> > whereas, on the node SG-azfw2-190, I see membership as
> >
> > Online: [ SG-azfw2-190 SG-azfw2-191 ]
> > OFFLINE: [ SG-azfw2-189 ]
> >
> > Full list of resources:
> >
> >   Master/Slave Set: ms_mysql [p_mysql]
> >   Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
> >   Stopped: [ SG-azfw2-189 ]
> >
> > I expected that on node SG-azfw2-189, it should have detected that other
> 2
> > nodes have left. In the corosync logs for this node, I continuously see
> the
> > below messages:
> > Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4.
> > Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the
> > rep.
> > Apr 30 11:00:03 corosync [MAIN  ] Storing new sequence id for ring 2e64
> > Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state.
> > Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state.
> > Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4.
> > Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the
> > rep.
> > Apr 30 11:00:33 corosync [MAIN  ] Storing new sequence id for ring 2e68
> > Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state.
> > Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state.
> >
> > On the other nodes - I see messages like
> >   notice: pcmk_peer_update: Transitional membership event on ring 11888:
> > memb=2, new=0, lost=0
> > Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > SG-azfw2-190 301994924
> > Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > SG-azfw2-191 603984812
> > Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1
> > Apr 30 11:06:10 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 11888: memb=2, new=0, lost=0
> > Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > SG-azfw2-190 301994924
> > Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > SG-azfw2-191 603984812
> > Apr 30 11:06:10 corosync [SYNC  ] This node is within the primary
> component
> > and will provide service.
> > Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state.
> >
> > Can the corosync experts please guide me on probable root cause for this
> or
> > ways to debug this further ? Help much appreciated.
> >
> > corosync version: 1.4.8.
> > pacemaker version:  1.1.14-8.el6_8.1
> >
> > Thanks!
> >
> >
> >
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Corosync unable to reach consensus for membership

2019-04-30 Thread Prasad Nagaraj
Hello :

I have a 3 node corosync and pacemaker cluster and the nodes are:
Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]

Full list of resources:

 Master/Slave Set: ms_mysql [p_mysql]
 Masters: [ SG-azfw2-189 ]
 Slaves: [ SG-azfw2-190 SG-azfw2-191 ]

For my network partition test, I created a firewall rule on Node
SG-azfw2-190   to block all incoming udp traffic from node SG-azfw2-189
/sbin/iptables -I  INPUT -p udp -s 172.19.0.13 -j DROP

I dont think corosync is correctly detecting the partition as I am getting
different membership information from different nodes.
On node  SG-azfw2-189, I still see the members as:

Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]

Full list of resources:

 Master/Slave Set: ms_mysql [p_mysql]
 Masters: [ SG-azfw2-189 ]
 Slaves: [ SG-azfw2-190 SG-azfw2-191 ]

whereas, on the node SG-azfw2-190, I see membership as

Online: [ SG-azfw2-190 SG-azfw2-191 ]
OFFLINE: [ SG-azfw2-189 ]

Full list of resources:

 Master/Slave Set: ms_mysql [p_mysql]
 Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
 Stopped: [ SG-azfw2-189 ]

I expected that on node SG-azfw2-189, it should have detected that other 2
nodes have left. In the corosync logs for this node, I continuously see the
below messages:
Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4.
Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the
rep.
Apr 30 11:00:03 corosync [MAIN  ] Storing new sequence id for ring 2e64
Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state.
Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state.
Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4.
Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the
rep.
Apr 30 11:00:33 corosync [MAIN  ] Storing new sequence id for ring 2e68
Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state.
Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state.

On the other nodes - I see messages like
 notice: pcmk_peer_update: Transitional membership event on ring 11888:
memb=2, new=0, lost=0
Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
SG-azfw2-190 301994924
Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
SG-azfw2-191 603984812
Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1
Apr 30 11:06:10 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 11888: memb=2, new=0, lost=0
Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
SG-azfw2-190 301994924
Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
SG-azfw2-191 603984812
Apr 30 11:06:10 corosync [SYNC  ] This node is within the primary component
and will provide service.
Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state.

Can the corosync experts please guide me on probable root cause for this or
ways to debug this further ? Help much appreciated.

corosync version: 1.4.8.
pacemaker version:  1.1.14-8.el6_8.1

Thanks!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Regarding Finalization Timer (I_ELECTION) just popped (1800000ms)

2018-10-30 Thread Prasad Nagaraj
Hi - Any help on this will be very much appreciated.

Thanks!
Prasad

On Sun, Oct 28, 2018 at 11:32 PM Prasad Nagaraj 
wrote:

> Hi :
>
> I came across  a strange situation in my cluster few days back. I was
> trying to replace one of the nodes in my 3 node cluster and I removed and
> added that node back following due process as per
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_removing_a_corosync_node.html.
> Updated corosync.conf on all nodes and restarted corosync and pacemaker on
> all the nodes. The cluster didnt elect a DC and also didnt report any
> activities for almost 28 mins as seen from below logs. Then I saw this
> message:
> Oct 22 22:35:14 [76417] vm85c4465533   crmd: info:
> crm_timer_popped:Finalization Timer (I_ELECTION) just popped (180ms)
> after which I could see further activities happening including DC election.
>
> I was not able to understand or identify any reasons for this behavior and
> also there is absolutely no documentation on this Finalization timer and
> what it means. Appreciate any help in terms of explaining what exactly this
> timer means and what could be reasons for this behavior. I have pasted a
> snippet of logs during the time here. I do have more logs and also logs
> from other nodes that I can share if required.
>
> Thanks in advance for the help!
> Prasad
>
>
> Oct 22 22:07:46 [76412] vm85c4465533cib: info:
> cib_process_request: Completed cib_modify operation for section nodes: OK
> (rc=0, origin=vm46890219c5/crm_attribute/4, version=0.100.0)
> Oct 22 22:07:46 [76412] vm85c4465533cib: info:
> cib_file_write_with_digest:  Reading cluster configuration file
> /var/lib/pacemaker/cib/cib.QC3Kay (digest:
> /var/lib/pacemaker/cib/cib.c74mjr)
> Oct 22 22:07:47 [76412] vm85c4465533cib: info:
> cib_file_backup: Archived previous version as
> /var/lib/pacemaker/cib/cib-96.raw
> Oct 22 22:07:47 [76412] vm85c4465533cib: info:
> cib_file_write_with_digest:  Wrote version 0.100.0 of the CIB to disk
> (digest: 3ab4bffdc9c372985cfe50ad3100131d)
> Oct 22 22:07:47 [76412] vm85c4465533cib: info:
> cib_file_write_with_digest:  Reading cluster configuration file
> /var/lib/pacemaker/cib/cib.7NqOEy (digest:
> /var/lib/pacemaker/cib/cib.gamHhs)
> Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> cib_perform_op:  Diff: --- 0.100.0 2
> Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> cib_perform_op:  Diff: +++ 0.101.0 (null)
> Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> cib_perform_op:  +  /cib:  @epoch=101
> Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> cib_perform_op:  ++ /cib/configuration/constraints:   id="ms_mysql_member453" rsc="ms_mysql" score="0" node="vm46890219c5"/>
> Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> cib_process_request: Completed cib_apply_diff operation for section 'all':
> OK (rc=0, origin=local/cibadmin/2, version=0.101.0)
> Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> cib_file_backup: Archived previous version as
> /var/lib/pacemaker/cib/cib-97.raw
> Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> cib_file_write_with_digest:  Wrote version 0.101.0 of the CIB to disk
> (digest: 7ffa91a8ca752581d4c1df13d287e467)
> Oct 22 22:07:48 [76412] vm85c4465533cib: info:
> cib_file_write_with_digest:  Reading cluster configuration file
> /var/lib/pacemaker/cib/cib.JkMV2C (digest:
> /var/lib/pacemaker/cib/cib.mLy23A)
> Oct 22 22:07:49 [76412] vm85c4465533cib: info:
> cib_perform_op:  Diff: --- 0.101.0 2
> Oct 22 22:07:49 [76412] vm85c4465533cib: info:
> cib_perform_op:  Diff: +++ 0.102.0 (null)
> Oct 22 22:07:49 [76412] vm85c4465533cib: info:
> cib_perform_op:  +  /cib:  @epoch=102
> Oct 22 22:07:49 [76412] vm85c4465533cib: info:
> cib_perform_op:  +
> /cib/configuration/resources/master[@id='ms_mysql']/meta_attributes[@id='ms_mysql-meta_attributes']/nvpair[@id='ms_mysql-meta_attributes-maintenance']:
> @value=false
> Oct 22 22:07:49 [76412] vm85c4465533cib: info:
> cib_process_request: Completed cib_modify operation for section resources:
> OK (rc=0, origin=local/crm_resource/6, version=0.102.0)
> Oct 22 22:07:49 [76412] vm85c4465533cib: info:
> cib_file_backup: Archived previous version as
> /var/lib/pacemaker/cib/cib-98.raw
> Oct 22 22:07:49 [76412] vm85c4465533cib: info:
> cib_file_write_with_digest:  Wrote version 0.102.0 of the CIB to disk
> (digest: dd6263dc226d652721b09ec702e37742)
> Oct 22 22:07:49 [76412] vm85c4465533  

[ClusterLabs] Regarding Finalization Timer (I_ELECTION) just popped (1800000ms)

2018-10-28 Thread Prasad Nagaraj
Hi :

I came across  a strange situation in my cluster few days back. I was
trying to replace one of the nodes in my 3 node cluster and I removed and
added that node back following due process as per
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_removing_a_corosync_node.html.
Updated corosync.conf on all nodes and restarted corosync and pacemaker on
all the nodes. The cluster didnt elect a DC and also didnt report any
activities for almost 28 mins as seen from below logs. Then I saw this
message:
Oct 22 22:35:14 [76417] vm85c4465533   crmd: info:
crm_timer_popped:Finalization Timer (I_ELECTION) just popped (180ms)
after which I could see further activities happening including DC election.

I was not able to understand or identify any reasons for this behavior and
also there is absolutely no documentation on this Finalization timer and
what it means. Appreciate any help in terms of explaining what exactly this
timer means and what could be reasons for this behavior. I have pasted a
snippet of logs during the time here. I do have more logs and also logs
from other nodes that I can share if required.

Thanks in advance for the help!
Prasad


Oct 22 22:07:46 [76412] vm85c4465533cib: info:
cib_process_request: Completed cib_modify operation for section nodes: OK
(rc=0, origin=vm46890219c5/crm_attribute/4, version=0.100.0)
Oct 22 22:07:46 [76412] vm85c4465533cib: info:
cib_file_write_with_digest:  Reading cluster configuration file
/var/lib/pacemaker/cib/cib.QC3Kay (digest:
/var/lib/pacemaker/cib/cib.c74mjr)
Oct 22 22:07:47 [76412] vm85c4465533cib: info:
cib_file_backup: Archived previous version as
/var/lib/pacemaker/cib/cib-96.raw
Oct 22 22:07:47 [76412] vm85c4465533cib: info:
cib_file_write_with_digest:  Wrote version 0.100.0 of the CIB to disk
(digest: 3ab4bffdc9c372985cfe50ad3100131d)
Oct 22 22:07:47 [76412] vm85c4465533cib: info:
cib_file_write_with_digest:  Reading cluster configuration file
/var/lib/pacemaker/cib/cib.7NqOEy (digest:
/var/lib/pacemaker/cib/cib.gamHhs)
Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_perform_op:
Diff: --- 0.100.0 2
Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_perform_op:
Diff: +++ 0.101.0 (null)
Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_perform_op:
+  /cib:  @epoch=101
Oct 22 22:07:48 [76412] vm85c4465533cib: info: cib_perform_op:
++ /cib/configuration/constraints:  
Oct 22 22:07:48 [76412] vm85c4465533cib: info:
cib_process_request: Completed cib_apply_diff operation for section 'all':
OK (rc=0, origin=local/cibadmin/2, version=0.101.0)
Oct 22 22:07:48 [76412] vm85c4465533cib: info:
cib_file_backup: Archived previous version as
/var/lib/pacemaker/cib/cib-97.raw
Oct 22 22:07:48 [76412] vm85c4465533cib: info:
cib_file_write_with_digest:  Wrote version 0.101.0 of the CIB to disk
(digest: 7ffa91a8ca752581d4c1df13d287e467)
Oct 22 22:07:48 [76412] vm85c4465533cib: info:
cib_file_write_with_digest:  Reading cluster configuration file
/var/lib/pacemaker/cib/cib.JkMV2C (digest:
/var/lib/pacemaker/cib/cib.mLy23A)
Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_perform_op:
Diff: --- 0.101.0 2
Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_perform_op:
Diff: +++ 0.102.0 (null)
Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_perform_op:
+  /cib:  @epoch=102
Oct 22 22:07:49 [76412] vm85c4465533cib: info: cib_perform_op:
+
/cib/configuration/resources/master[@id='ms_mysql']/meta_attributes[@id='ms_mysql-meta_attributes']/nvpair[@id='ms_mysql-meta_attributes-maintenance']:
@value=false
Oct 22 22:07:49 [76412] vm85c4465533cib: info:
cib_process_request: Completed cib_modify operation for section resources:
OK (rc=0, origin=local/crm_resource/6, version=0.102.0)
Oct 22 22:07:49 [76412] vm85c4465533cib: info:
cib_file_backup: Archived previous version as
/var/lib/pacemaker/cib/cib-98.raw
Oct 22 22:07:49 [76412] vm85c4465533cib: info:
cib_file_write_with_digest:  Wrote version 0.102.0 of the CIB to disk
(digest: dd6263dc226d652721b09ec702e37742)
Oct 22 22:07:49 [76412] vm85c4465533cib: info:
cib_file_write_with_digest:  Reading cluster configuration file
/var/lib/pacemaker/cib/cib.DOunDE (digest:
/var/lib/pacemaker/cib/cib.MYBrjE)
Oct 22 22:35:14 [76417] vm85c4465533   crmd: info:
crm_timer_popped:Finalization Timer (I_ELECTION) just popped (180ms)
Oct 22 22:35:14 [76417] vm85c4465533   crmd: info:
do_state_transition: State transition S_FINALIZE_JOIN -> S_ELECTION [
input=I_ELECTION cause=C_TIMER_POPPED origin=crm_timer_popped ]
Oct 22 22:35:14 [76417] vm85c4465533   crmd: info: update_dc:
 Unset DC. Was vm85c4465533
Oct 22 22:35:14 [76417] vm85c4465533   crmd: info:

Re: [ClusterLabs] Understanding the behavior of pacemaker crash

2018-09-28 Thread Prasad Nagaraj
Hi Ken - Only if I turn off corosync on the node [ where I crashed
pacemaker] other nodes are able to detect and put the node as OFFLINE.
Do you have any other guidance or insights into this ?

Thanks
Prasad

On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj 
wrote:

> Hi Ken - Thanks for the response. Pacemaker is still not running on that
> node. So I am still wondering what could be the issue ? Any other
> configurations or logs should I be sharing to understand this more ?
>
> Thanks!
>
> On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot  wrote:
>
>> On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
>> > Hello - I was trying to understand the behavior or cluster when
>> > pacemaker crashes on one of the nodes. So I hard killed pacemakerd
>> > and its related processes.
>> >
>> > ---
>> > -
>> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > root  74022  1  0 07:53 pts/000:00:00 pacemakerd
>> > 189   74028  74022  0 07:53 ?00:00:00
>> > /usr/libexec/pacemaker/cib
>> > root  74029  74022  0 07:53 ?00:00:00
>> > /usr/libexec/pacemaker/stonithd
>> > root  74030  74022  0 07:53 ?00:00:00
>> > /usr/libexec/pacemaker/lrmd
>> > 189   74031  74022  0 07:53 ?00:00:00
>> > /usr/libexec/pacemaker/attrd
>> > 189   74032  74022  0 07:53 ?00:00:00
>> > /usr/libexec/pacemaker/pengine
>> > 189   74033  74022  0 07:53 ?00:00:00
>> > /usr/libexec/pacemaker/crmd
>> >
>> > root  75228  50092  0 07:54 pts/000:00:00 grep pacemaker
>> > [root@SG-mysqlold-907 azureuser]# kill -9 74022
>> >
>> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > root  74030  1  0 07:53 ?00:00:00
>> > /usr/libexec/pacemaker/lrmd
>> > 189   74032  1  0 07:53 ?00:00:00
>> > /usr/libexec/pacemaker/pengine
>> >
>> > root  75303  50092  0 07:55 pts/000:00:00 grep pacemaker
>> > [root@SG-mysqlold-907 azureuser]# kill -9 74030
>> > [root@SG-mysqlold-907 azureuser]# kill -9 74032
>> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > root  75332  50092  0 07:55 pts/000:00:00 grep pacemaker
>> >
>> > [root@SG-mysqlold-907 azureuser]# crm satus
>> > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
>> > Transport endpoint is not connected
>> > ---
>> > --
>> >
>> > However, this does not seem to be having any effect on the cluster
>> > status from other nodes
>> > ---
>> > 
>> >
>> > [root@SG-mysqlold-909 azureuser]# crm status
>> > Last updated: Thu Sep 27 07:56:17 2018  Last change: Thu Sep
>> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > Stack: classic openais (with plugin)
>> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
>> > partition with quorum
>> > 3 nodes and 3 resources configured, 3 expected votes
>> >
>> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>>
>> It most definitely would make the node offline, and if fencing were
>> configured, the rest of the cluster would fence the node to make sure
>> it's safely down.
>>
>> I see you're using the old corosync 1 plugin. I suspect what happened
>> in this case is that corosync noticed the plugin died and restarted it
>> quickly enough that it had rejoined by the time you checked the status
>> elsewhere.
>>
>> >
>> > Full list of resources:
>> >
>> >  Master/Slave Set: ms_mysql [p_mysql]
>> >  Masters: [ SG-mysqlold-909 ]
>> >  Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
>> >
>> >
>> > [root@SG-mysqlold-908 azureuser]# crm status
>> > Last updated: Thu Sep 27 07:56:08 2018  Last change: Thu Sep
>> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > Stack: classic openais (with plugin)
>> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
>> > partition with quorum
>> > 3 nodes and 3 resources configured, 3 expected votes
>> >
>> >

Re: [ClusterLabs] Understanding the behavior of pacemaker crash

2018-09-27 Thread Prasad Nagaraj
Hi Ken - Thanks for the response. Pacemaker is still not running on that
node. So I am still wondering what could be the issue ? Any other
configurations or logs should I be sharing to understand this more ?

Thanks!

On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot  wrote:

> On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
> > Hello - I was trying to understand the behavior or cluster when
> > pacemaker crashes on one of the nodes. So I hard killed pacemakerd
> > and its related processes.
> >
> > ---
> > -
> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root  74022  1  0 07:53 pts/000:00:00 pacemakerd
> > 189   74028  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/cib
> > root  74029  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/stonithd
> > root  74030  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/lrmd
> > 189   74031  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/attrd
> > 189   74032  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/pengine
> > 189   74033  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/crmd
> >
> > root  75228  50092  0 07:54 pts/000:00:00 grep pacemaker
> > [root@SG-mysqlold-907 azureuser]# kill -9 74022
> >
> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root  74030  1  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/lrmd
> > 189   74032  1  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/pengine
> >
> > root  75303  50092  0 07:55 pts/000:00:00 grep pacemaker
> > [root@SG-mysqlold-907 azureuser]# kill -9 74030
> > [root@SG-mysqlold-907 azureuser]# kill -9 74032
> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root  75332  50092  0 07:55 pts/000:00:00 grep pacemaker
> >
> > [root@SG-mysqlold-907 azureuser]# crm satus
> > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
> > Transport endpoint is not connected
> > ---
> > --
> >
> > However, this does not seem to be having any effect on the cluster
> > status from other nodes
> > ---
> > 
> >
> > [root@SG-mysqlold-909 azureuser]# crm status
> > Last updated: Thu Sep 27 07:56:17 2018  Last change: Thu Sep
> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > Stack: classic openais (with plugin)
> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> > partition with quorum
> > 3 nodes and 3 resources configured, 3 expected votes
> >
> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>
> It most definitely would make the node offline, and if fencing were
> configured, the rest of the cluster would fence the node to make sure
> it's safely down.
>
> I see you're using the old corosync 1 plugin. I suspect what happened
> in this case is that corosync noticed the plugin died and restarted it
> quickly enough that it had rejoined by the time you checked the status
> elsewhere.
>
> >
> > Full list of resources:
> >
> >  Master/Slave Set: ms_mysql [p_mysql]
> >  Masters: [ SG-mysqlold-909 ]
> >  Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> >
> >
> > [root@SG-mysqlold-908 azureuser]# crm status
> > Last updated: Thu Sep 27 07:56:08 2018  Last change: Thu Sep
> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > Stack: classic openais (with plugin)
> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> > partition with quorum
> > 3 nodes and 3 resources configured, 3 expected votes
> >
> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> >
> > Full list of resources:
> >
> >  Master/Slave Set: ms_mysql [p_mysql]
> >  Masters: [ SG-mysqlold-909 ]
> >  Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> >
> > ---
> > ---
> >
> > I am bit surprised that other nodes are not able to detect that
> > pacemaker is down on one of the nodes - SG-mysqlold-907
> >
> > Even if I kill pacemaker on the node which is a DC -

[ClusterLabs] Understanding the behavior of pacemaker crash

2018-09-27 Thread Prasad Nagaraj
Hello - I was trying to understand the behavior or cluster when pacemaker
crashes on one of the nodes. So I hard killed pacemakerd and its related
processes.


[root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
root  74022  1  0 07:53 pts/000:00:00 pacemakerd
189   74028  74022  0 07:53 ?00:00:00 /usr/libexec/pacemaker/cib
root  74029  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/stonithd
root  74030  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/lrmd
189   74031  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/attrd
189   74032  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/pengine
189   74033  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/crmd

root  75228  50092  0 07:54 pts/000:00:00 grep pacemaker
[root@SG-mysqlold-907 azureuser]# kill -9 74022

[root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
root  74030  1  0 07:53 ?00:00:00
/usr/libexec/pacemaker/lrmd
189   74032  1  0 07:53 ?00:00:00
/usr/libexec/pacemaker/pengine

root  75303  50092  0 07:55 pts/000:00:00 grep pacemaker
[root@SG-mysqlold-907 azureuser]# kill -9 74030
[root@SG-mysqlold-907 azureuser]# kill -9 74032
[root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
root  75332  50092  0 07:55 pts/000:00:00 grep pacemaker

[root@SG-mysqlold-907 azureuser]# crm satus
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport
endpoint is not connected
-

However, this does not seem to be having any effect on the cluster status
from other nodes
---

[root@SG-mysqlold-909 azureuser]# crm status
Last updated: Thu Sep 27 07:56:17 2018  Last change: Thu Sep 27
07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
Stack: classic openais (with plugin)
Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - partition
with quorum
3 nodes and 3 resources configured, 3 expected votes

Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]

Full list of resources:

 Master/Slave Set: ms_mysql [p_mysql]
 Masters: [ SG-mysqlold-909 ]
 Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]


[root@SG-mysqlold-908 azureuser]# crm status
Last updated: Thu Sep 27 07:56:08 2018  Last change: Thu Sep 27
07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
Stack: classic openais (with plugin)
Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - partition
with quorum
3 nodes and 3 resources configured, 3 expected votes

Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]

Full list of resources:

 Master/Slave Set: ms_mysql [p_mysql]
 Masters: [ SG-mysqlold-909 ]
 Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]

--

I am bit surprised that other nodes are not able to detect that pacemaker
is down on one of the nodes - SG-mysqlold-907

Even if I kill pacemaker on the node which is a DC - I observe the same
behavior with rest of the nodes not detecting that DC is down.

Could some one explain what is the expected behavior in these cases ?

I am using corosync 1.4.7 and pacemaker 1.1.14

Thanks in advance
Prasad
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] 'crm node standby' command failing with "Error performing operation: Communication error on send . Return code is 70"

2018-09-21 Thread Prasad Nagaraj
Hi -

Yesterday, I noticed that when I am trying to execute 'crm node standby'
command on one of my cluster nodes, it was failing with

"Error performing operation: Communication error on send . Return code is
70"

My corosync logs had these entries during that time:

Sep 20 22:14:54 [4454] vm5c336912f1   crmd:   notice:
throttle_handle_load: High CPU load detected: 1.85
Sep 20 22:14:57 [4449] vm5c336912f1cib: info:
cib_process_ping: Reporting our current digest to vmb546073338:
8fe67fcfcd20515c246c225a124a8902 for 0.481.2 (0x2742230 0)
Sep 20 22:15:09 [4449] vm5c336912f1cib: info:
cib_process_request:  Forwarding cib_modify operation for section nodes to
master (origin=local/crm_attribute/4)
Sep 20 22:15:24 [4454] vm5c336912f1   crmd:   notice:
throttle_handle_load: High CPU load detected: 1.64
Sep 20 22:15:54 [4454] vm5c336912f1   crmd: info:
throttle_handle_load: Moderate CPU load detected: 0.99
Sep 20 22:15:54 [4454] vm5c336912f1   crmd: info:
throttle_send_command:New throttle mode: 0010 (was 0100)
Sep 20 22:16:24 [4454] vm5c336912f1   crmd: info:
throttle_send_command:New throttle mode: 0001 (was 0010)
Sep 20 22:16:54 [4454] vm5c336912f1   crmd: info:
throttle_send_command:New throttle mode:  (was 0001)
Sep 20 22:17:09 [4449] vm5c336912f1cib: info:
cib_process_request:  Forwarding cib_modify operation for section nodes to
master (origin=local/crm_attribute/4)
Sep 20 22:19:10 [4449] vm5c336912f1cib: info:
cib_process_request:  Forwarding cib_modify operation for section nodes to
master (origin=local/crm_attribute/4)
Sep 20 22:23:08 [4449] vm5c336912f1cib: info:
cib_perform_op:   Diff: --- 0.481.2 2
Sep 20 22:23:08 [4449] vm5c336912f1cib: info:
cib_perform_op:   Diff: +++ 0.482.0 9bacc862b8713430c81ea91694942a41
Sep 20 22:23:08 [4449] vm5c336912f1cib: info:
cib_perform_op:   +  /cib:  @epoch=482, @num_updates=0


Is the above behavior due to pacemaker thinking that cluster is highly
loaded and trying to throttle the execution of commands ? What is the best
way to resolve or work-around such problems. We do have high io load on our
cluster - which hosts mysql database.

Also from the thread,
https://lists.clusterlabs.org/pipermail/users/2017-May/005702.html

it was asked :

*>There is not much detail about “load-threshold”.*>* Please can
someone share steps or any commands to modify “load-threshold”.*

Could someone advise whether this is the way to control the throttling of
cluster operations and how to set this parameter ?

Thanks in advance,
Prasad
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-22 Thread Prasad Nagaraj
Hi - My systems are single core cpu VMs running on azure platform. I am
running MySQL on the nodes that do generate high io load. And my bad , I
meant to say 'High CPU load detected' logged by crmd and not corosync.
Corosync logs messages like 'Corosync main process was not scheduled
for.' kind of messages which inturn makes pacemaker monitor action to
fail sometimes. Is increasing token timeout a solution for this or any
other ways ?

Thanks for the help
Prasaf

On Wed, 22 Aug 2018, 11:55 am Jan Friesse,  wrote:

> Prasad,
>
> > Thanks Ken and Ulrich. There is definitely high IO on the system with
> > sometimes IOWAIT s of upto 90%
> > I have come across some previous posts that IOWAIT is also considered as
> > CPU load by Corosync. Is this true ? Does having high IO may lead
> corosync
> > complain as in " Corosync main process was not scheduled for..." or "Hi
> > CPU load detected.." ?
>
> Yes it can.
>
> Corosync never logs "Hi CPU load detected...".
>
> >
> > I will surely monitor the system more.
>
> Is that system VM or physical machine? Because " Corosync main process
> was not scheduled for..." is usually happening on VMs where hosts are
> highly overloaded.
>
> Honza
>
> >
> > Thanks for your help.
> > Prasad
> >
> >
> >
> > On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot 
> wrote:
> >
> >> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
> >>>>>> Prasad Nagaraj  schrieb am
> >>>>>> 21.08.2018 um 11:42 in
> >>>
> >>> Nachricht
> >>> :
> >>>> Hi Ken - Thanks for you response.
> >>>>
> >>>> We do have seen messages in other cases like
> >>>> corosync [MAIN  ] Corosync main process was not scheduled for
> >>>> 17314.4746 ms
> >>>> (threshold is 8000. ms). Consider token timeout increase.
> >>>> corosync [TOTEM ] A processor failed, forming new configuration.
> >>>>
> >>>> Is this the indication of a failure due to CPU load issues and will
> >>>> this
> >>>> get resolved if I upgrade to Corosync 2.x series ?
> >>
> >> Yes, most definitely this is a CPU issue. It means corosync isn't
> >> getting enough CPU cycles to handle the cluster token before the
> >> timeout is reached.
> >>
> >> Upgrading may indeed help, as recent versions ensure that corosync runs
> >> with real-time priority in the kernel, and thus are more likely to get
> >> CPU time when something of lower priority is consuming all the CPU.
> >>
> >> But of course, there is some underlying problem that should be
> >> identified and addressed. Figure out what's maxing out the CPU or I/O.
> >> Ulrich's monitoring suggestion is a good start.
> >>
> >>> Hi!
> >>>
> >>> I'd strongly recommend starting monitoring on your nodes, at least
> >>> until you know what's going on. The good old UNIX sa (sysstat
> >>> package) could be a starting point. I'd monitor CPU idle
> >>> specifically. Then go for 100% device utilization, then look for
> >>> network bottlenecks...
> >>>
> >>> A new corosync release cannot fix those, most likely.
> >>>
> >>> Regards,
> >>> Ulrich
> >>>
> >>>>
> >>>> In any case, for the current scenario, we did not see any
> >>>> scheduling
> >>>> related messages.
> >>>>
> >>>> Thanks for your help.
> >>>> Prasad
> >>>>
> >>>> On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
> >>>> wrote:
> >>>>
> >>>>> On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> >>>>>> Hi:
> >>>>>>
> >>>>>> One of these days, I saw a spurious node loss on my 3-node
> >>>>>> corosync
> >>>>>> cluster with following logged in the corosync.log of one of the
> >>>>>> nodes.
> >>>>>>
> >>>>>> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> >>>>>> Transitional membership event on ring 32: memb=2, new=0, lost=1
> >>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> >>>>>> vm02d780875f 67114156
> >>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> >>>>

Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

2018-08-21 Thread Prasad Nagaraj
Thanks Ken and Ulrich. There is definitely high IO on the system with
sometimes IOWAIT s of upto 90%
I have come across some previous posts that IOWAIT is also considered as
CPU load by Corosync. Is this true ? Does having high IO may lead corosync
complain as in " Corosync main process was not scheduled for..." or "High
CPU load detected.." ?

I will surely monitor the system more.

Thanks for your help.
Prasad



On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot  wrote:

> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
> > > > > Prasad Nagaraj  schrieb am
> > > > > 21.08.2018 um 11:42 in
> >
> > Nachricht
> > :
> > > Hi Ken - Thanks for you response.
> > >
> > > We do have seen messages in other cases like
> > > corosync [MAIN  ] Corosync main process was not scheduled for
> > > 17314.4746 ms
> > > (threshold is 8000. ms). Consider token timeout increase.
> > > corosync [TOTEM ] A processor failed, forming new configuration.
> > >
> > > Is this the indication of a failure due to CPU load issues and will
> > > this
> > > get resolved if I upgrade to Corosync 2.x series ?
>
> Yes, most definitely this is a CPU issue. It means corosync isn't
> getting enough CPU cycles to handle the cluster token before the
> timeout is reached.
>
> Upgrading may indeed help, as recent versions ensure that corosync runs
> with real-time priority in the kernel, and thus are more likely to get
> CPU time when something of lower priority is consuming all the CPU.
>
> But of course, there is some underlying problem that should be
> identified and addressed. Figure out what's maxing out the CPU or I/O.
> Ulrich's monitoring suggestion is a good start.
>
> > Hi!
> >
> > I'd strongly recommend starting monitoring on your nodes, at least
> > until you know what's going on. The good old UNIX sa (sysstat
> > package) could be a starting point. I'd monitor CPU idle
> > specifically. Then go for 100% device utilization, then look for
> > network bottlenecks...
> >
> > A new corosync release cannot fix those, most likely.
> >
> > Regards,
> > Ulrich
> >
> > >
> > > In any case, for the current scenario, we did not see any
> > > scheduling
> > > related messages.
> > >
> > > Thanks for your help.
> > > Prasad
> > >
> > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot 
> > > wrote:
> > >
> > > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> > > > > Hi:
> > > > >
> > > > > One of these days, I saw a spurious node loss on my 3-node
> > > > > corosync
> > > > > cluster with following logged in the corosync.log of one of the
> > > > > nodes.
> > > > >
> > > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > > Transitional membership event on ring 32: memb=2, new=0, lost=1
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > > vm02d780875f 67114156
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > > vmfa2757171f 151000236
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> > > > > vm728316982d 201331884
> > > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > > Stable
> > > > > membership event on ring 32: memb=2, new=0, lost=0
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > > vm02d780875f 67114156
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > > vmfa2757171f 151000236
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > > ais_mark_unseen_peer_dead:
> > > > > Node vm728316982d was not seen in the previous transition
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> > > > > 201331884/vm728316982d is now: lost
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > > send_member_notification:
> > > > > Sending membership update 32 to 3 children
> > > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
> > > > > the
> > > > > membership and a new membership was formed.
> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
> > > > > plugin_handle_membership: Membersh

[ClusterLabs] Spurious node loss in corosync cluster

2018-08-19 Thread Prasad Nagaraj
Hi:

One of these days, I saw a spurious node loss on my 3-node corosync cluster
with following logged in the corosync.log of one of the nodes.

Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 32: memb=2, new=0, lost=1
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
vm728316982d 201331884
Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 32: memb=2, new=0, lost=0
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info: ais_mark_unseen_peer_dead: Node
vm728316982d was not seen in the previous transition
Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
201331884/vm728316982d is now: lost
Aug 18 12:40:25 corosync [pcmk  ] info: send_member_notification: Sending
membership update 32 to 3 children
Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
plugin_handle_membership: Membership 32: quorum retained
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
plugin_handle_membership: Membership 32: quorum retained
Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
peer_update_callback: vm728316982d is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f   crmd:  warning:
match_down_event: No match for shutdown action on vm728316982d
Aug 18 12:40:25 [4548] vmfa2757171f   crmd:   notice:
peer_update_callback: Stonith/shutdown of vm728316982d not matched
Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
crm_update_peer_join: peer_update_callback: Node vm728316982d[201331884] -
join-6 phase 4 -> 0
Aug 18 12:40:25 [4548] vmfa2757171f   crmd: info:
abort_transition_graph:   Transition aborted: Node failure
(source=peer_update_callback:240, 1)
Aug 18 12:40:25 [4543] vmfa2757171fcib: info:
plugin_handle_membership: Membership 32: quorum retained
Aug 18 12:40:25 [4543] vmfa2757171fcib:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4543] vmfa2757171fcib:   notice:
crm_reap_dead_member: Removing vm728316982d/201331884 from the membership
list
Aug 18 12:40:25 [4543] vmfa2757171fcib:   notice: reap_crm_member:
Purged 1 peers with id=201331884 and/or uname=vm728316982d from the
membership cache
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
crm_reap_dead_member: Removing vm728316982d/201331884 from the membership
list
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice: reap_crm_member:
Purged 1 peers with id=201331884 and/or uname=vm728316982d from the
membership cache

However, within seconds, the node was able to join back.

Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 36: memb=3, new=1, lost=0
Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
201331884/vm728316982d is now: member
Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
vm728316982d 201331884


But this was enough time for the cluster to get into split brain kind of
situation with  a resource on the node vm728316982d being stopped because
of this node loss detection.

Could anyone help whether this could happen due to any transient network
distortion or so ?
Are there any configuration settings that can be applied in corosync.conf
so that cluster is more resilient to such temporary distortions.

Currently my corosync.conf looks like this :

compatibility: whitetank
totem {
version: 2
secauth: on
threads: 0
interface {
member {
memberaddr: 172.20.0.4
}
member {
memberaddr: 172.20.0.9
}
member {
memberaddr: 172.20.0.12
}

bindnetaddr: 172.20.0.12

ringnumber: 0
mcastport: 5405
ttl: 1
}
transport: udpu
token: 1
token_retransmits_before_loss_const: 10
}

logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: no
logfile: /var/log/cluster/corosync.log
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
service {
name: pacemaker
ver: 1
}
amf {
mode: disabled
}

Thanks in advance for the help.
Prasad

Re: [ClusterLabs] corosync not able to form cluster

2018-06-08 Thread Prasad Nagaraj
Hi Christine - Thanks for looking into the logs.
I also see that the node eventually comes out of GATHER state here:

Jun 07 16:56:10 corosync [TOTEM ] entering GATHER state from 0.
Jun 07 16:56:10 corosync [TOTEM ] Creating commit token because I am the rep.

Does it mean, it has timed out or given up and then came out ?

second point, I did see some unexpected entries when I did tcpdump on the
node coro.4.. [ Its also pasted in one of the earlier threads] You can see
that it was receiving messages like

10:23:17.117347 IP 172.22.0.13.50468 > 172.22.0.4.netsupport: UDP, length
332
10:23:17.140960 IP 172.22.0.8.50438 > 172.22.0.4.netsupport: UDP, length 82
10:23:17.141319 IP 172.22.0.6.38535 > 172.22.0.4.netsupport: UDP, length 156

Please note that 172.22.0.8 and 172.22.0.6 are not part of my group and I
was wondering why these messages are coming ?

Thanks!

On Fri, Jun 8, 2018 at 2:34 PM, Christine Caulfield 
wrote:

> On 07/06/18 18:32, Prasad Nagaraj wrote:
> > Hi Christine - Got it:)
> >
> > I have collected few seconds of debug logs from all nodes after startup.
> > Please find them attached.
> > Please let me know if this will help us to identify rootcause.
> >
>
> The problem is on the node coro.4 - it never gets out of the JOIN
>
> "Jun 07 16:55:37 corosync [TOTEM ] entering GATHER state from 11."
>
> process so something is wrong on that node, either a rogue routing table
> entry, dangling iptables rule or even a broken NIC.
>
> Chrissie
>
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Prasad Nagaraj
Hi - As you can see in the corosync.conf details - i have already kept
debug: on

Thanks!

On Thu, 7 Jun 2018, 8:03 pm Christine Caulfield, 
wrote:

> On 07/06/18 15:24, Prasad Nagaraj wrote:
> >
> > No iptables or otherwise firewalls are setup on these nodes.
> >
> > One observation is that each node sends messages on with its own ring
> > sequence number which is not converging.. I have seen that in a good
> > cluster, when nodes respond with same sequence number, the membership is
> > automatically formed. But in our case, that is not the case.
> >
>
> That's just a side-effect of the cluster not forming. It's not causing
> it. Can you enable full corosync debugging (just add debug:on to the end
> of the logging {} stanza) and see if that has any more useful
> information (I only need the corosync bits, not the pcmk ones)
>
> Chrissie
>
> > Example: we can see that one node sends
> > Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> > membership event on ring 71084: memb=1, new=0, lost=0
> > .
> > Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> > membership event on ring 71096: memb=1, new=0, lost=0
> > Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 71096: memb=1, new=0, lost=0
> >
> > other node sends messages with its own numbers
> > Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> > membership event on ring 71088: memb=1, new=0, lost=0
> > Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 71088: memb=1, new=0, lost=0
> > ...
> > Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> > membership event on ring 71100: memb=1, new=0, lost=0
> > Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 71100: memb=1, new=0, lost=0
> >
> > Any idea why this happens, and why the seq. numbers from different nodes
> > are not converging ?
> >
> > Thanks!
> >
> >
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Prasad Nagaraj
No iptables or otherwise firewalls are setup on these nodes.

One observation is that each node sends messages on with its own ring
sequence number which is not converging.. I have seen that in a good
cluster, when nodes respond with same sequence number, the membership is
automatically formed. But in our case, that is not the case.

Example: we can see that one node sends
Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71084: memb=1, new=0, lost=0
.
Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71096: memb=1, new=0, lost=0
Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71096: memb=1, new=0, lost=0

other node sends messages with its own numbers
Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71088: memb=1, new=0, lost=0
Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71088: memb=1, new=0, lost=0
...
Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71100: memb=1, new=0, lost=0
Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71100: memb=1, new=0, lost=0

Any idea why this happens, and why the seq. numbers from different nodes
are not converging ?

Thanks!
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Prasad Nagaraj
 joined or left the membership
and a new membership was formed.
Jun 07 10:49:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 78152: memb=1, new=0, lost=0
Jun 07 10:49:40 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm2883711991 184555180
Jun 07 10:49:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 78152: memb=1, new=0, lost=0
Jun 07 10:49:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 10:49:40 corosync [pcmk  ] info: update_member: 0x1576f20 Node
184555180 ((null)) born on: 78140
Jun 07 10:49:40 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jun 07 10:49:40 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.11) ; members(old:0 left:0)
Jun 07 10:49:40 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 07 10:49:52 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 78160: memb=1, new=0, lost=0
Jun 07 10:49:52 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm2883711991 184555180
Jun 07 10:49:52 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 78160: memb=1, new=0, lost=0
Jun 07 10:49:52 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 10:49:52 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.



On Thu, Jun 7, 2018 at 4:01 PM, Prasad Nagaraj 
wrote:

> Hi Christine -
>
> Thanks for looking into this and here are the details.
> All the nodes are pingable from each other and actively exchanging
> corosync packets from each other as seen from tcpdump
>
> Here is the ifconfig out from each of the node
> # ifconfig
> eth0  Link encap:Ethernet  HWaddr 00:0D:3A:03:35:64
>   inet addr:172.22.0.4  Bcast:172.22.0.255  Mask:255.255.255.0
>   inet6 addr: fe80::20d:3aff:fe03:3564/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:3721169 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:2780455 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:1229505889 (1.1 GiB)  TX bytes:982021535 (936.5 MiB)
>
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:65536  Metric:1
>   RX packets:1367018 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:1367018 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:459591075 (438.3 MiB)  TX bytes:459591075 (438.3 MiB)
> 
> -
> # ifconfig
> eth0  Link encap:Ethernet  HWaddr 00:0D:3A:03:38:D7
>   inet addr:172.22.0.11  Bcast:172.22.0.255  Mask:255.255.255.0
>   inet6 addr: fe80::20d:3aff:fe03:38d7/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:4052226 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:3744671 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:1930027786 (1.7 GiB)  TX bytes:1180930029 (1.0 GiB)
>
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:65536  Metric:1
>   RX packets:1394930 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:1394930 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:508170210 (484.6 MiB)  TX bytes:508170210 (484.6 MiB)
> 
> --
>
> # ifconfig
> eth0  Link encap:Ethernet  HWaddr 00:0D:3A:04:06:F6
>   inet addr:172.22.0.13  Bcast:172.22.0.255  Mask:255.255.255.0
>   inet6 addr: fe80::20d:3aff:fe04:6f6/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:3974698 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:3891546 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:1903617077 (1.7 GiB)  TX bytes:1234961001 (1.1 GiB)
>
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:65536  Metric:1
>   RX packets:1503643 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:1503643 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:541177718 (516.1 

Re: [ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Prasad Nagaraj
22.0.4.netsupport: UDP, length
332
10:25:30.827563 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP, length
332
10:25:30.850832 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP, length
376
10:25:30.863531 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP, length
332
10:25:30.886664 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP, length
332
10:25:30.886691 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP, length
332
10:25:30.910820 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP, length
376
10:25:30.923403 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP, length
332
10:25:30.946507 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP, length
332
10:25:30.946531 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP, length
332
10:25:30.970931 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP, length
376
10:25:30.983055 IP 172.22.0.13.57332 > 172.22.0.11.netsupport: UDP, length
332
10:25:31.006306 IP 172.22.0.11.54545 > 172.22.0.4.netsupport: UDP, length
332
10:25:31.006339 IP 172.22.0.11.44864 > 172.22.0.13.netsupport: UDP, length
332
10:25:31.030207 IP 172.22.0.4.34060 > 172.22.0.11.netsupport: UDP, length
376


And here is the lsof output for each node.
lsof -i | grep corosync
corosync  47873  root   10u  IPv4 1193147  0t0  UDP 172.22.0.4:
netsupport
corosync  47873  root   13u  IPv4 1193151  0t0  UDP 172.22.0.4:45846
corosync  47873  root   14u  IPv4 1193152  0t0  UDP 172.22.0.4:34060
corosync  47873  root   15u  IPv4 1193153  0t0  UDP 172.22.0.4:40755

lsof -i | grep corosync
corosync  11039  root   10u  IPv4   54862  0t0  UDP 172.22.0.13:
netsupport
corosync  11039  root   13u  IPv4   54869  0t0  UDP
172.22.0.13:50468
corosync  11039  root   14u  IPv4   54870  0t0  UDP
172.22.0.13:57332
corosync  11039  root   15u  IPv4   54871  0t0  UDP
172.22.0.13:46460

 lsof -i | grep corosync
corosync  75188  root   10u  IPv4 1582737  0t0  UDP 172.22.0.11:
netsupport
corosync  75188  root   13u  IPv4 1582741  0t0  UDP
172.22.0.11:54545
corosync  75188  root   14u  IPv4 1582742  0t0  UDP
172.22.0.11:53213
corosync  75188  root   15u  IPv4 1582743  0t0  UDP
172.22.0.11:44864


Thanks!



On Thu, Jun 7, 2018 at 3:33 PM, Christine Caulfield 
wrote:

> On 07/06/18 09:21, Prasad Nagaraj wrote:
> > Hi - I am running corosync on  3 nodes of CentOS release 6.9 (Final).
> > Corosync version is  corosync-1.4.7.
> > The nodes are not seeing each other and not able to form memberships.
> > What I see is continuous message about " A processor joined or left the
> > membership and a new membership was formed."
> > For example:on node:  vm2883711991
> >
>
> I can't draw any conclusions from the logs, we'd need to see what
> corosync though it was binding to and the IP addresses of the hosts.
>
> Have a look at the start of the logs and see if they match what you'd
> expect (ie are similar to the ones on the working clusters), Also check
> using lsof, to see what addresses corosync is bound to. tcpdump on port
> 5405 will show you if traffic is leaving the nodes and being received.
>
> Also check firewall settings and make sure the nodes can ping each other.
>
> If you're still stumped them feel free to post more info here for us to
> look at, though if you have that configuration working on other nodes it
> might be something in your environment
>
> Chrissie
>
>
> >
> > Jun 07 07:54:52 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > vm2883711991 184555180
> > Jun 07 07:54:52 corosync [TOTEM ] A processor joined or left the
> > membership and a new membership was formed.
> > Jun 07 07:54:52 corosync [CPG   ] chosen downlist: sender r(0)
> > ip(172.22.0.11) ; members(old:1 left:0)
> > Jun 07 07:54:52 corosync [MAIN  ] Completed service synchronization,
> > ready to provide service.
> > Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> > membership event on ring 71084: memb=1, new=0, lost=0
> > Jun 07 07:55:04 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > vm2883711991 184555180
> > Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 71084: memb=1, new=0, lost=0
> > Jun 07 07:55:04 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > vm2883711991 184555180
> > Jun 07 07:55:04 corosync [TOTEM ] A processor joined or left the
> > membership and a new membership was formed.
> > Jun 07 07:55:04 corosync [CPG   ] chosen downlist: sender r(0)
> > ip(172.22.0.11) ; members(old:1 left:0)
> > Jun 07 07:55:04 corosync [MAIN  ] Completed service synchronization,
> > ready to 

[ClusterLabs] corosync not able to form cluster

2018-06-07 Thread Prasad Nagaraj
 Hi - I am running corosync on  3 nodes of CentOS release 6.9 (Final).
Corosync version is  corosync-1.4.7.
The nodes are not seeing each other and not able to form memberships.
What I see is continuous message about " A processor joined or left the
membership and a new membership was formed."
For example:on node:  vm2883711991


Jun 07 07:54:52 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 07:54:52 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jun 07 07:54:52 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.11) ; members(old:1 left:0)
Jun 07 07:54:52 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71084: memb=1, new=0, lost=0
Jun 07 07:55:04 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm2883711991 184555180
Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71084: memb=1, new=0, lost=0
Jun 07 07:55:04 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 07:55:04 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jun 07 07:55:04 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.11) ; members(old:1 left:0)
Jun 07 07:55:04 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71096: memb=1, new=0, lost=0
Jun 07 07:55:16 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm2883711991 184555180
Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71096: memb=1, new=0, lost=0
Jun 07 07:55:16 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 07:55:16 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jun 07 07:55:16 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.11) ; members(old:1 left:0)
Jun 07 07:55:16 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 07 07:55:28 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71108: memb=1, new=0, lost=0
Jun 07 07:55:28 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm2883711991 184555180
Jun 07 07:55:28 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71108: memb=1, new=0, lost=0
Jun 07 07:55:28 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 07:55:28 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jun 07 07:55:28 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.11) ; members(old:1 left:0)
Jun 07 07:55:28 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 07 07:55:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71120: memb=1, new=0, lost=0
Jun 07 07:55:40 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm2883711991 184555180
Jun 07 07:55:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71120: memb=1, new=0, lost=0
Jun 07 07:55:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 07:55:40 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jun 07 07:55:40 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.11) ; members(old:1 left:0)
Jun 07 07:55:40 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 07 07:55:52 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71132: memb=1, new=0, lost=0
Jun 07 07:55:52 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm2883711991 184555180
Jun 07 07:55:52 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71132: memb=1, new=0, lost=0
Jun 07 07:55:52 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 07:55:52 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jun 07 07:55:52 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.11) ; members(old:1 left:0)
Jun 07 07:55:52 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 07 07:56:04 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 71144: memb=1, new=0, lost=0
Jun 07 07:56:04 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm2883711991 184555180
Jun 07 07:56:04 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 71144: memb=1, new=0, lost=0
Jun 07 07:56:04 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm2883711991 184555180
Jun 07 07:56:04 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Jun 07 07:56:17 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on 

[ClusterLabs] Continuous membership events in Corosync

2018-05-03 Thread Prasad Nagaraj
Hi - I am trying to set up a 3 node cluster. I have got the corosync up and
running on all the nodes and all nodes have joined the cluster. However, in
the corosync logs, I am seeing continuous and repeated messages about
membership events on rings. It comes every 12 seconds on all nodes.

Here is a sample:

May 03 10:46:29 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 2264: memb=3, new=0, lost=0
May 03 10:46:29 corosync [pcmk  ] info: pcmk_peer_update: memb:
vme6c794899e 83891884
May 03 10:46:29 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmc9d15655fe 151000748
May 03 10:46:29 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm5e42438470 184555180
May 03 10:46:29 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 2264: memb=3, new=0, lost=0
May 03 10:46:29 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vme6c794899e 83891884
May 03 10:46:29 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmc9d15655fe 151000748
May 03 10:46:29 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm5e42438470 184555180
May 03 10:46:29 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
May 03 10:46:29 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.5) ; members(old:3 left:0)
May 03 10:46:29 corosync [MAIN  ] Completed service synchronization, ready
to provide service.

May 03 10:46:41 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 2268: memb=3, new=0, lost=0
May 03 10:46:41 corosync [pcmk  ] info: pcmk_peer_update: memb:
vme6c794899e 83891884
May 03 10:46:41 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmc9d15655fe 151000748
May 03 10:46:41 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm5e42438470 184555180
May 03 10:46:41 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 2268: memb=3, new=0, lost=0
May 03 10:46:41 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vme6c794899e 83891884
May 03 10:46:41 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmc9d15655fe 151000748
May 03 10:46:41 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm5e42438470 184555180
May 03 10:46:41 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
May 03 10:46:41 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.5) ; members(old:3 left:0)
May 03 10:46:41 corosync [MAIN  ] Completed service synchronization, ready
to provide service.

May 03 10:46:54 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 2272: memb=3, new=0, lost=0
May 03 10:46:54 corosync [pcmk  ] info: pcmk_peer_update: memb:
vme6c794899e 83891884
May 03 10:46:54 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmc9d15655fe 151000748
May 03 10:46:54 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm5e42438470 184555180
May 03 10:46:54 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 2272: memb=3, new=0, lost=0
May 03 10:46:54 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vme6c794899e 83891884
May 03 10:46:54 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmc9d15655fe 151000748
May 03 10:46:54 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm5e42438470 184555180
May 03 10:46:54 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
May 03 10:46:54 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.5) ; members(old:3 left:0)
May 03 10:46:54 corosync [MAIN  ] Completed service synchronization, ready
to provide service.

May 03 10:47:06 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 2276: memb=3, new=0, lost=0
May 03 10:47:06 corosync [pcmk  ] info: pcmk_peer_update: memb:
vme6c794899e 83891884
May 03 10:47:06 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmc9d15655fe 151000748
May 03 10:47:06 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm5e42438470 184555180
May 03 10:47:06 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 2276: memb=3, new=0, lost=0
May 03 10:47:06 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vme6c794899e 83891884
May 03 10:47:06 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmc9d15655fe 151000748
May 03 10:47:06 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm5e42438470 184555180
May 03 10:47:06 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
May 03 10:47:06 corosync [CPG   ] chosen downlist: sender r(0)
ip(172.22.0.5) ; members(old:3 left:0)
May 03 10:47:06 corosync [MAIN  ] Completed service synchronization, ready
to provide service.

My corosync version is :corosync-1.4.7-5.el6.x86_64

my corosync.conf is :
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
member {
memberaddr: 172.22.0.5
}
member {
memberaddr: 172.22.0.9
}
member {
memberaddr: 172.22.0.11
}

bindnetaddr: 172.22.0.5

ringnumber: 0