>>> Prasad Nagaraj <prasad.nagara...@gmail.com> schrieb am 21.08.2018 um 11:42 >>> in Nachricht <cahbcuj0zdvpyalcr7tbnggb8qrzhh8udje+rsnkoewvmfb8...@mail.gmail.com>: > Hi Ken - Thanks for you response. > > We do have seen messages in other cases like > corosync [MAIN ] Corosync main process was not scheduled for 17314.4746 ms > (threshold is 8000.0000 ms). Consider token timeout increase. > corosync [TOTEM ] A processor failed, forming new configuration. > > Is this the indication of a failure due to CPU load issues and will this > get resolved if I upgrade to Corosync 2.x series ?
Hi! I'd strongly recommend starting monitoring on your nodes, at least until you know what's going on. The good old UNIX sa (sysstat package) could be a starting point. I'd monitor CPU idle specifically. Then go for 100% device utilization, then look for network bottlenecks... A new corosync release cannot fix those, most likely. Regards, Ulrich > > In any case, for the current scenario, we did not see any scheduling > related messages. > > Thanks for your help. > Prasad > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgail...@redhat.com> wrote: > >> On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote: >> > Hi: >> > >> > One of these days, I saw a spurious node loss on my 3-node corosync >> > cluster with following logged in the corosync.log of one of the >> > nodes. >> > >> > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: >> > Transitional membership event on ring 32: memb=2, new=0, lost=1 >> > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: >> > vm02d780875f 67114156 >> > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: >> > vmfa2757171f 151000236 >> > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: >> > vm728316982d 201331884 >> > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: Stable >> > membership event on ring 32: memb=2, new=0, lost=0 >> > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: >> > vm02d780875f 67114156 >> > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: >> > vmfa2757171f 151000236 >> > Aug 18 12:40:25 corosync [pcmk ] info: ais_mark_unseen_peer_dead: >> > Node vm728316982d was not seen in the previous transition >> > Aug 18 12:40:25 corosync [pcmk ] info: update_member: Node >> > 201331884/vm728316982d is now: lost >> > Aug 18 12:40:25 corosync [pcmk ] info: send_member_notification: >> > Sending membership update 32 to 3 children >> > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the >> > membership and a new membership was formed. >> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info: >> > plugin_handle_membership: Membership 32: quorum retained >> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: >> > crm_update_peer_state_iter: plugin_handle_membership: Node >> > vm728316982d[201331884] - state is now lost (was member) >> > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: >> > plugin_handle_membership: Membership 32: quorum retained >> > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: >> > crm_update_peer_state_iter: plugin_handle_membership: Node >> > vm728316982d[201331884] - state is now lost (was member) >> > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: >> > peer_update_callback: vm728316982d is now lost (was member) >> > Aug 18 12:40:25 [4548] vmfa2757171f crmd: warning: >> > match_down_event: No match for shutdown action on vm728316982d >> > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: >> > peer_update_callback: Stonith/shutdown of vm728316982d not matched >> > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: >> > crm_update_peer_join: peer_update_callback: Node >> > vm728316982d[201331884] - join-6 phase 4 -> 0 >> > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: >> > abort_transition_graph: Transition aborted: Node failure >> > (source=peer_update_callback:240, 1) >> > Aug 18 12:40:25 [4543] vmfa2757171f cib: info: >> > plugin_handle_membership: Membership 32: quorum retained >> > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: >> > crm_update_peer_state_iter: plugin_handle_membership: Node >> > vm728316982d[201331884] - state is now lost (was member) >> > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: >> > crm_reap_dead_member: Removing vm728316982d/201331884 from the >> > membership list >> > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: >> > reap_crm_member: Purged 1 peers with id=201331884 and/or >> > uname=vm728316982d from the membership cache >> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: >> > crm_reap_dead_member: Removing vm728316982d/201331884 from the >> > membership list >> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: >> > reap_crm_member: Purged 1 peers with id=201331884 and/or >> > uname=vm728316982d from the membership cache >> > >> > However, within seconds, the node was able to join back. >> > >> > Aug 18 12:40:34 corosync [pcmk ] notice: pcmk_peer_update: Stable >> > membership event on ring 36: memb=3, new=1, lost=0 >> > Aug 18 12:40:34 corosync [pcmk ] info: update_member: Node >> > 201331884/vm728316982d is now: member >> > Aug 18 12:40:34 corosync [pcmk ] info: pcmk_peer_update: NEW: >> > vm728316982d 201331884 >> > >> > >> > But this was enough time for the cluster to get into split brain kind >> > of situation with a resource on the node vm728316982d being stopped >> > because of this node loss detection. >> > >> > Could anyone help whether this could happen due to any transient >> > network distortion or so ? >> > Are there any configuration settings that can be applied in >> > corosync.conf so that cluster is more resilient to such temporary >> > distortions. >> >> Your corosync sensitivity of 10-second token timeout and 10 >> retransimissions is already very lengthy -- likely the node was already >> unresponsive for more than 10 seconds before the first message above, >> so it was more than 18 seconds before it rejoined. >> >> It's rarely a good idea to change token_retransmits_before_loss_const; >> changing token is generally enough to deal with transient network >> unreliability. However 18 seconds is a really long time to raise the >> token to, and it's uncertain from the information here whether the root >> cause was networking or something on the host. >> >> I notice your configuration is corosync 1 with the pacemaker plugin; >> that is a long-deprecated setup, and corosync 3 is about to come out, >> so you may want to consider upgrading to at least corosync 2 and a >> reasonably recent pacemaker. That would give you some reliability >> improvements, including real-time priority scheduling of corosync, >> which could have been the issue here if CPU load rather than networking >> was the root cause. >> >> > >> > Currently my corosync.conf looks like this : >> > >> > compatibility: whitetank >> > totem { >> > version: 2 >> > secauth: on >> > threads: 0 >> > interface { >> > member { >> > memberaddr: 172.20.0.4 >> > } >> > member { >> > memberaddr: 172.20.0.9 >> > } >> > member { >> > memberaddr: 172.20.0.12 >> > } >> > >> > bindnetaddr: 172.20.0.12 >> > >> > ringnumber: 0 >> > mcastport: 5405 >> > ttl: 1 >> > } >> > transport: udpu >> > token: 10000 >> > token_retransmits_before_loss_const: 10 >> > } >> > >> > logging { >> > fileline: off >> > to_stderr: yes >> > to_logfile: yes >> > to_syslog: no >> > logfile: /var/log/cluster/corosync.log >> > timestamp: on >> > logger_subsys { >> > subsys: AMF >> > debug: off >> > } >> > } >> > service { >> > name: pacemaker >> > ver: 1 >> > } >> > amf { >> > mode: disabled >> > } >> > >> > Thanks in advance for the help. >> > Prasad >> > >> > _______________________________________________ >> > Users mailing list: Users@clusterlabs.org >> > https://lists.clusterlabs.org/mailman/listinfo/users >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. >> > pdf >> > Bugs: http://bugs.clusterlabs.org >> -- >> Ken Gaillot <kgail...@redhat.com> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org