Thanks Ken and Ulrich. There is definitely high IO on the system with sometimes IOWAIT s of upto 90% I have come across some previous posts that IOWAIT is also considered as CPU load by Corosync. Is this true ? Does having high IO may lead corosync complain as in " Corosync main process was not scheduled for..." or "High CPU load detected.." ?
I will surely monitor the system more. Thanks for your help. Prasad On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot <kgail...@redhat.com> wrote: > On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote: > > > > > Prasad Nagaraj <prasad.nagara...@gmail.com> schrieb am > > > > > 21.08.2018 um 11:42 in > > > > Nachricht > > <cahbcuj0zdvpyalcr7tbnggb8qrzhh8udje+rsnkoewvmfb8...@mail.gmail.com>: > > > Hi Ken - Thanks for you response. > > > > > > We do have seen messages in other cases like > > > corosync [MAIN ] Corosync main process was not scheduled for > > > 17314.4746 ms > > > (threshold is 8000.0000 ms). Consider token timeout increase. > > > corosync [TOTEM ] A processor failed, forming new configuration. > > > > > > Is this the indication of a failure due to CPU load issues and will > > > this > > > get resolved if I upgrade to Corosync 2.x series ? > > Yes, most definitely this is a CPU issue. It means corosync isn't > getting enough CPU cycles to handle the cluster token before the > timeout is reached. > > Upgrading may indeed help, as recent versions ensure that corosync runs > with real-time priority in the kernel, and thus are more likely to get > CPU time when something of lower priority is consuming all the CPU. > > But of course, there is some underlying problem that should be > identified and addressed. Figure out what's maxing out the CPU or I/O. > Ulrich's monitoring suggestion is a good start. > > > Hi! > > > > I'd strongly recommend starting monitoring on your nodes, at least > > until you know what's going on. The good old UNIX sa (sysstat > > package) could be a starting point. I'd monitor CPU idle > > specifically. Then go for 100% device utilization, then look for > > network bottlenecks... > > > > A new corosync release cannot fix those, most likely. > > > > Regards, > > Ulrich > > > > > > > > In any case, for the current scenario, we did not see any > > > scheduling > > > related messages. > > > > > > Thanks for your help. > > > Prasad > > > > > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgail...@redhat.com> > > > wrote: > > > > > > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote: > > > > > Hi: > > > > > > > > > > One of these days, I saw a spurious node loss on my 3-node > > > > > corosync > > > > > cluster with following logged in the corosync.log of one of the > > > > > nodes. > > > > > > > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: > > > > > Transitional membership event on ring 32: memb=2, new=0, lost=1 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > > > > > vm02d780875f 67114156 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: > > > > > vmfa2757171f 151000236 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: > > > > > vm728316982d 201331884 > > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: > > > > > Stable > > > > > membership event on ring 32: memb=2, new=0, lost=0 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > > > > vm02d780875f 67114156 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: > > > > > vmfa2757171f 151000236 > > > > > Aug 18 12:40:25 corosync [pcmk ] info: > > > > > ais_mark_unseen_peer_dead: > > > > > Node vm728316982d was not seen in the previous transition > > > > > Aug 18 12:40:25 corosync [pcmk ] info: update_member: Node > > > > > 201331884/vm728316982d is now: lost > > > > > Aug 18 12:40:25 corosync [pcmk ] info: > > > > > send_member_notification: > > > > > Sending membership update 32 to 3 children > > > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left > > > > > the > > > > > membership and a new membership was formed. > > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info: > > > > > plugin_handle_membership: Membership 32: quorum retained > > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > > > > > crm_update_peer_state_iter: plugin_handle_membership: Node > > > > > vm728316982d[201331884] - state is now lost (was member) > > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > > > > > plugin_handle_membership: Membership 32: quorum retained > > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: > > > > > crm_update_peer_state_iter: plugin_handle_membership: Node > > > > > vm728316982d[201331884] - state is now lost (was member) > > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > > > > > peer_update_callback: vm728316982d is now lost (was member) > > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: warning: > > > > > match_down_event: No match for shutdown action on > > > > > vm728316982d > > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: > > > > > peer_update_callback: Stonith/shutdown of vm728316982d not > > > > > matched > > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > > > > > crm_update_peer_join: peer_update_callback: Node > > > > > vm728316982d[201331884] - join-6 phase 4 -> 0 > > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: > > > > > abort_transition_graph: Transition aborted: Node failure > > > > > (source=peer_update_callback:240, 1) > > > > > Aug 18 12:40:25 [4543] vmfa2757171f cib: info: > > > > > plugin_handle_membership: Membership 32: quorum retained > > > > > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: > > > > > crm_update_peer_state_iter: plugin_handle_membership: Node > > > > > vm728316982d[201331884] - state is now lost (was member) > > > > > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: > > > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the > > > > > membership list > > > > > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: > > > > > reap_crm_member: Purged 1 peers with id=201331884 and/or > > > > > uname=vm728316982d from the membership cache > > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > > > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the > > > > > membership list > > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: > > > > > reap_crm_member: Purged 1 peers with id=201331884 and/or > > > > > uname=vm728316982d from the membership cache > > > > > > > > > > However, within seconds, the node was able to join back. > > > > > > > > > > Aug 18 12:40:34 corosync [pcmk ] notice: pcmk_peer_update: > > > > > Stable > > > > > membership event on ring 36: memb=3, new=1, lost=0 > > > > > Aug 18 12:40:34 corosync [pcmk ] info: update_member: Node > > > > > 201331884/vm728316982d is now: member > > > > > Aug 18 12:40:34 corosync [pcmk ] info: pcmk_peer_update: NEW: > > > > > vm728316982d 201331884 > > > > > > > > > > > > > > > But this was enough time for the cluster to get into split > > > > > brain kind > > > > > of situation with a resource on the node vm728316982d being > > > > > stopped > > > > > because of this node loss detection. > > > > > > > > > > Could anyone help whether this could happen due to any > > > > > transient > > > > > network distortion or so ? > > > > > Are there any configuration settings that can be applied in > > > > > corosync.conf so that cluster is more resilient to such > > > > > temporary > > > > > distortions. > > > > > > > > Your corosync sensitivity of 10-second token timeout and 10 > > > > retransimissions is already very lengthy -- likely the node was > > > > already > > > > unresponsive for more than 10 seconds before the first message > > > > above, > > > > so it was more than 18 seconds before it rejoined. > > > > > > > > It's rarely a good idea to change > > > > token_retransmits_before_loss_const; > > > > changing token is generally enough to deal with transient network > > > > unreliability. However 18 seconds is a really long time to raise > > > > the > > > > token to, and it's uncertain from the information here whether > > > > the root > > > > cause was networking or something on the host. > > > > > > > > I notice your configuration is corosync 1 with the pacemaker > > > > plugin; > > > > that is a long-deprecated setup, and corosync 3 is about to come > > > > out, > > > > so you may want to consider upgrading to at least corosync 2 and > > > > a > > > > reasonably recent pacemaker. That would give you some reliability > > > > improvements, including real-time priority scheduling of > > > > corosync, > > > > which could have been the issue here if CPU load rather than > > > > networking > > > > was the root cause. > > > > > > > > > > > > > > Currently my corosync.conf looks like this : > > > > > > > > > > compatibility: whitetank > > > > > totem { > > > > > version: 2 > > > > > secauth: on > > > > > threads: 0 > > > > > interface { > > > > > member { > > > > > memberaddr: 172.20.0.4 > > > > > } > > > > > member { > > > > > memberaddr: 172.20.0.9 > > > > > } > > > > > member { > > > > > memberaddr: 172.20.0.12 > > > > > } > > > > > > > > > > bindnetaddr: 172.20.0.12 > > > > > > > > > > ringnumber: 0 > > > > > mcastport: 5405 > > > > > ttl: 1 > > > > > } > > > > > transport: udpu > > > > > token: 10000 > > > > > token_retransmits_before_loss_const: 10 > > > > > } > > > > > > > > > > logging { > > > > > fileline: off > > > > > to_stderr: yes > > > > > to_logfile: yes > > > > > to_syslog: no > > > > > logfile: /var/log/cluster/corosync.log > > > > > timestamp: on > > > > > logger_subsys { > > > > > subsys: AMF > > > > > debug: off > > > > > } > > > > > } > > > > > service { > > > > > name: pacemaker > > > > > ver: 1 > > > > > } > > > > > amf { > > > > > mode: disabled > > > > > } > > > > > > > > > > Thanks in advance for the help. > > > > > Prasad > > > > > > > > > > _______________________________________________ > > > > > Users mailing list: Users@clusterlabs.org > > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > > > Project Home: http://www.clusterlabs.org > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc > > > > > ratch. > > > > > pdf > > > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > -- > > > > Ken Gaillot <kgail...@redhat.com> > > > > _______________________________________________ > > > > Users mailing list: Users@clusterlabs.org > > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > Project Home: http://www.clusterlabs.org > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra > > > > tch.pdf > > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > > pdf > > Bugs: http://bugs.clusterlabs.org > -- > Ken Gaillot <kgail...@redhat.com> > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org