Attila Megyeri napsal(a): > Hi Honza, > > What I also found in the log related to the freeze at 12:22:26: > > > Corosync main process was not scheduled for XXXX... Can It be the general > cause of the issue? > > > > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:58597->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:47943->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:47943->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:59647->[10.9.1.3]:161 > > > Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was > not scheduled for 6327.5918 ms (threshold is 4000.0000 ms). Consider token > timeout increase. > > > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the > OPERATIONAL state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming > new configuration. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from > 2(The token was lost in the OPERATIONAL state.). > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token > because I am the rep. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high > seq received 6a8c > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for > ring 7dc > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: > > .... > > > Regards, > Attila > >> -----Original Message----- >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] >> Sent: Thursday, March 13, 2014 2:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >>> -----Original Message----- >>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] >>> Sent: Thursday, March 13, 2014 1:45 PM >>> To: The Pacemaker cluster resource manager; Andrew Beekhof >>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>> >>> Hello, >>> >>>> -----Original Message----- >>>> From: Jan Friesse [mailto:jfrie...@redhat.com] >>>> Sent: Thursday, March 13, 2014 10:03 AM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> ... >>>> >>>>>>>> >>>>>>>> Also can you please try to set debug: on in corosync.conf and >>>>>>>> paste full corosync.log then? >>>>>>> >>>>>>> I set debug to on, and did a few restarts but could not >>>>>>> reproduce the issue >>>>>> yet - will post the logs as soon as I manage to reproduce. >>>>>>> >>>>>> >>>>>> Perfect. >>>>>> >>>>>> Another option you can try to set is netmtu (1200 is usually safe). >>>>> >>>>> Finally I was able to reproduce the issue. >>>>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately >>>>> (not >>>> when node was up again). >>>>> >>>>> The corosync log with debug on is available at: >>>>> http://pastebin.com/kTpDqqtm >>>>> >>>>> >>>>> To be honest, I had to wait much longer for this reproduction as >>>>> before, >>>> even though there was no change in the corosync configuration - just >>>> potentially some system updates. But anyway, the issue is >>>> unfortunately still there. >>>>> Previously, when this issue came, cpu was at 100% on all nodes - >>>>> this time >>>> only on ctmgr, which was the DC... >>>>> >>>>> I hope you can find some useful details in the log. >>>>> >>>> >>>> Attila, >>>> what seems to be interesting is >>>> >>>> Configuration ERRORs found during PE processing. Please run >>>> "crm_verify - >>> L" >>>> to identify issues. >>>> >>>> I'm unsure how much is this problem but I'm really not pacemaker >> expert. >>> >>> Perhaps Andrew could comment on that. Any idea? >>> >>> >>>> >>>> Anyway, I have theory what may happening and it looks like related >>>> with IPC (and probably not related to network). But to make sure we >>>> will not try fixing already fixed bug, can you please build: >>>> - New libqb (0.17.0). There are plenty of fixes in IPC >>>> - Corosync 2.3.3 (already plenty IPC fixes) >>>> - And maybe also newer pacemaker >>>> >>> >>> I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 >>> from Ubuntu package. >>> I am currently building libqb 0.17.0, will update you on the results. >>> >>> In the meantime we had another freeze, which did not seem to be >>> related to any restarts, but brought all coroync processes to 100%. >>> Please check out the corosync.log, perhaps it is a different cause: >>> http://pastebin.com/WMwzv0Rr >>> >>> >>> In the meantime I will install the new libqb and send logs if we have >>> further issues. >>> >>> Thank you very much for your help! >>> >>> Regards, >>> Attila >>> >> >> One more question: >> >> If I install libqb 0.17.0 from source, do I need to rebuild corosync as >> well, or if >> it was built with libqb 0.16.0 it will be fine? >>
Theoretically everything should work (both libqb and corosync keeps binary compatibility). In practice it's always better to recompile. >> BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can >> see if it makes a difference. If I see crashes on the outdated ones, but not >> on >> the new ones, we are fine. :) >> >> Thanks, >> >> Attila >> >> >> >> >> >> >> >>> >>> >>>> I know you were not very happy using hand-compiled sources, but >>>> please give them at least a try. >>>> >>>> Thanks, >>>> Honza >>>> >>>>> Thanks, >>>>> Attila >>>>> >>>>> >>>>> >>>>>> >>>>>> Regards, >>>>>> Honza >>>>>> >>>>>>> >>>>>>> There are also a few things that might or might not be related: >>>>>>> >>>>>>> 1) Whenever I want to edit the configuration with "crm configure >>>>>>> edit", >>>> >>>> ... >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org