Hi Honza, What I also found in the log related to the freeze at 12:22:26:
Corosync main process was not scheduled for XXXX... Can It be the general cause of the issue? Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597->[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647->[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000.0000 ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: .... Regards, Attila > -----Original Message----- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Thursday, March 13, 2014 2:27 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > -----Original Message----- > > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > > Sent: Thursday, March 13, 2014 1:45 PM > > To: The Pacemaker cluster resource manager; Andrew Beekhof > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > Hello, > > > > > -----Original Message----- > > > From: Jan Friesse [mailto:jfrie...@redhat.com] > > > Sent: Thursday, March 13, 2014 10:03 AM > > > To: The Pacemaker cluster resource manager > > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > > > ... > > > > > > >>>> > > > >>>> Also can you please try to set debug: on in corosync.conf and > > > >>>> paste full corosync.log then? > > > >>> > > > >>> I set debug to on, and did a few restarts but could not > > > >>> reproduce the issue > > > >> yet - will post the logs as soon as I manage to reproduce. > > > >>> > > > >> > > > >> Perfect. > > > >> > > > >> Another option you can try to set is netmtu (1200 is usually safe). > > > > > > > > Finally I was able to reproduce the issue. > > > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately > > > > (not > > > when node was up again). > > > > > > > > The corosync log with debug on is available at: > > > > http://pastebin.com/kTpDqqtm > > > > > > > > > > > > To be honest, I had to wait much longer for this reproduction as > > > > before, > > > even though there was no change in the corosync configuration - just > > > potentially some system updates. But anyway, the issue is > > > unfortunately still there. > > > > Previously, when this issue came, cpu was at 100% on all nodes - > > > > this time > > > only on ctmgr, which was the DC... > > > > > > > > I hope you can find some useful details in the log. > > > > > > > > > > Attila, > > > what seems to be interesting is > > > > > > Configuration ERRORs found during PE processing. Please run > > > "crm_verify - > > L" > > > to identify issues. > > > > > > I'm unsure how much is this problem but I'm really not pacemaker > expert. > > > > Perhaps Andrew could comment on that. Any idea? > > > > > > > > > > Anyway, I have theory what may happening and it looks like related > > > with IPC (and probably not related to network). But to make sure we > > > will not try fixing already fixed bug, can you please build: > > > - New libqb (0.17.0). There are plenty of fixes in IPC > > > - Corosync 2.3.3 (already plenty IPC fixes) > > > - And maybe also newer pacemaker > > > > > > > I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 > > from Ubuntu package. > > I am currently building libqb 0.17.0, will update you on the results. > > > > In the meantime we had another freeze, which did not seem to be > > related to any restarts, but brought all coroync processes to 100%. > > Please check out the corosync.log, perhaps it is a different cause: > > http://pastebin.com/WMwzv0Rr > > > > > > In the meantime I will install the new libqb and send logs if we have > > further issues. > > > > Thank you very much for your help! > > > > Regards, > > Attila > > > > One more question: > > If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, > or if > it was built with libqb 0.16.0 it will be fine? > > BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can > see if it makes a difference. If I see crashes on the outdated ones, but not > on > the new ones, we are fine. :) > > Thanks, > > Attila > > > > > > > > > > > > > > I know you were not very happy using hand-compiled sources, but > > > please give them at least a try. > > > > > > Thanks, > > > Honza > > > > > > > Thanks, > > > > Attila > > > > > > > > > > > > > > > >> > > > >> Regards, > > > >> Honza > > > >> > > > >>> > > > >>> There are also a few things that might or might not be related: > > > >>> > > > >>> 1) Whenever I want to edit the configuration with "crm configure > > > >>> edit", > > > > > > ... > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org Getting started: > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org