Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

Jan Friesse Wed, 22 Aug 2018 23:20:01 -0700

Prasad,

Hi - My systems are single core cpu VMs running on azure platform. I am

Ok, now it make sense. I don't think you get too much guarantees in thecloud environment so quite a large scheduling pause simply can happen.Also single core CPU is kind of "unsupported" today.

running MySQL on the nodes that do generate high io load. And my bad , I
meant to say 'High CPU load detected' logged by crmd and not corosync.
Corosync logs messages like 'Corosync main process was not scheduled
for.....' kind of messages which inturn makes pacemaker monitor action to
fail sometimes. Is increasing token timeout a solution for this or any
other ways ?

Yes. Actually increasing token timeout to quite a big values seems to beonly "solution". Please keep in mind that it's not for free - whenreal problem happens corosync does not detect it before token timeoutexpires, so with large token timeout detection last long.


Regards,
  Honza


Thanks for the help
Prasaf

On Wed, 22 Aug 2018, 11:55 am Jan Friesse, <jfrie...@redhat.com> wrote:

Prasad,

Thanks Ken and Ulrich. There is definitely high IO on the system with
sometimes IOWAIT s of upto 90%
I have come across some previous posts that IOWAIT is also considered as
CPU load by Corosync. Is this true ? Does having high IO may lead

corosync

complain as in " Corosync main process was not scheduled for..." or "Hi
CPU load detected.." ?


Yes it can.

Corosync never logs "Hi CPU load detected...".


I will surely monitor the system more.


Is that system VM or physical machine? Because " Corosync main process
was not scheduled for..." is usually happening on VMs where hosts are
highly overloaded.

Honza


Thanks for your help.
Prasad



On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot <kgail...@redhat.com>

wrote:

On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:

Prasad Nagaraj <prasad.nagara...@gmail.com> schrieb am
21.08.2018 um 11:42 in


Nachricht
<cahbcuj0zdvpyalcr7tbnggb8qrzhh8udje+rsnkoewvmfb8...@mail.gmail.com>:

Hi Ken - Thanks for you response.

We do have seen messages in other cases like
corosync [MAIN  ] Corosync main process was not scheduled for
17314.4746 ms
(threshold is 8000.0000 ms). Consider token timeout increase.
corosync [TOTEM ] A processor failed, forming new configuration.

Is this the indication of a failure due to CPU load issues and will
this
get resolved if I upgrade to Corosync 2.x series ?


Yes, most definitely this is a CPU issue. It means corosync isn't
getting enough CPU cycles to handle the cluster token before the
timeout is reached.

Upgrading may indeed help, as recent versions ensure that corosync runs
with real-time priority in the kernel, and thus are more likely to get
CPU time when something of lower priority is consuming all the CPU.

But of course, there is some underlying problem that should be
identified and addressed. Figure out what's maxing out the CPU or I/O.
Ulrich's monitoring suggestion is a good start.

Hi!

I'd strongly recommend starting monitoring on your nodes, at least
until you know what's going on. The good old UNIX sa (sysstat
package) could be a starting point. I'd monitor CPU idle
specifically. Then go for 100% device utilization, then look for
network bottlenecks...

A new corosync release cannot fix those, most likely.

Regards,
Ulrich


In any case, for the current scenario, we did not see any
scheduling
related messages.

Thanks for your help.
Prasad

On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgail...@redhat.com>
wrote:

On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:

Hi:

One of these days, I saw a spurious node loss on my 3-node
corosync
cluster with following logged in the corosync.log of one of the
nodes.

Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
Transitional membership event on ring 32: memb=2, new=0, lost=1
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
vm728316982d 201331884
Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
Stable
membership event on ring 32: memb=2, new=0, lost=0
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vm02d780875f 67114156
Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
vmfa2757171f 151000236
Aug 18 12:40:25 corosync [pcmk  ] info:
ais_mark_unseen_peer_dead:
Node vm728316982d was not seen in the previous transition
Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
201331884/vm728316982d is now: lost
Aug 18 12:40:25 corosync [pcmk  ] info:
send_member_notification:
Sending membership update 32 to 3 children
Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
the
membership and a new membership was formed.
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:
plugin_handle_membership:     Membership 32: quorum retained
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
plugin_handle_membership:     Membership 32: quorum retained
Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
peer_update_callback: vm728316982d is now lost (was member)
Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  warning:
match_down_event:     No match for shutdown action on
vm728316982d
Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
peer_update_callback: Stonith/shutdown of vm728316982d not
matched
Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
crm_update_peer_join: peer_update_callback: Node
vm728316982d[201331884] - join-6 phase 4 -> 0
Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
abort_transition_graph:       Transition aborted: Node failure
(source=peer_update_callback:240, 1)
Aug 18 12:40:25 [4543] vmfa2757171f        cib:     info:
plugin_handle_membership:     Membership 32: quorum retained
Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
crm_update_peer_state_iter:   plugin_handle_membership: Node
vm728316982d[201331884] - state is now lost (was member)
Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
crm_reap_dead_member: Removing vm728316982d/201331884 from the
membership list
Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
reap_crm_member:      Purged 1 peers with id=201331884 and/or
uname=vm728316982d from the membership cache
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
crm_reap_dead_member: Removing vm728316982d/201331884 from the
membership list
Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
reap_crm_member:      Purged 1 peers with id=201331884 and/or
uname=vm728316982d from the membership cache

However, within seconds, the node was able to join back.

Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update:
Stable
membership event on ring 36: memb=3, new=1, lost=0
Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
201331884/vm728316982d is now: member
Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
vm728316982d 201331884


But this was enough time for the cluster to get into split
brain kind
of situation with  a resource on the node vm728316982d being
stopped
because of this node loss detection.

Could anyone help whether this could happen due to any
transient
network distortion or so ?
Are there any configuration settings that can be applied in
corosync.conf so that cluster is more resilient to such
temporary
distortions.


Your corosync sensitivity of 10-second token timeout and 10
retransimissions is already very lengthy -- likely the node was
already
unresponsive for more than 10 seconds before the first message
above,
so it was more than 18 seconds before it rejoined.

It's rarely a good idea to change
token_retransmits_before_loss_const;
changing token is generally enough to deal with transient network
unreliability. However 18 seconds is a really long time to raise
the
token to, and it's uncertain from the information here whether
the root
cause was networking or something on the host.

I notice your configuration is corosync 1 with the pacemaker
plugin;
that is a long-deprecated setup, and corosync 3 is about to come
out,
so you may want to consider upgrading to at least corosync 2 and
a
reasonably recent pacemaker. That would give you some reliability
improvements, including real-time priority scheduling of
corosync,
which could have been the issue here if CPU load rather than
networking
was the root cause.


Currently my corosync.conf looks like this :

compatibility: whitetank
totem {
      version: 2
      secauth: on
      threads: 0
      interface {
      member {
              memberaddr: 172.20.0.4
          }
member {
              memberaddr: 172.20.0.9
          }
member {
              memberaddr: 172.20.0.12
          }

      bindnetaddr: 172.20.0.12

      ringnumber: 0
      mcastport: 5405
      ttl: 1
      }
      transport: udpu
      token: 10000
      token_retransmits_before_loss_const: 10
}

logging {
      fileline: off
      to_stderr: yes
      to_logfile: yes
      to_syslog: no
      logfile: /var/log/cluster/corosync.log
      timestamp: on
      logger_subsys {
      subsys: AMF
      debug: off
      }
}
service {
      name: pacemaker
      ver: 1
}
amf {
      mode: disabled
}

Thanks in advance for the help.
Prasad

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
ratch.
pdf
Bugs: http://bugs.clusterlabs.org


--
Ken Gaillot <kgail...@redhat.com>
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
tch.pdf
Bugs: http://bugs.clusterlabs.org




_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
pdf
Bugs: http://bugs.clusterlabs.org

--
Ken Gaillot <kgail...@redhat.com>
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org




_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

Reply via email to