Re: [Pacemaker] Pacemaker/corosync freeze
Hi Attila, Did you try compiling libqb 0.17.0 ? Wondering if that solved your issue ? I also have the same issue. Please suggest if you already solved it. Thanks Sreenivasa ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hi Andrew, > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 18, 2014 11:40 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 18 Mar 2014, at 6:03 pm, Attila Megyeri > wrote: > > > Hello, > > > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Tuesday, March 18, 2014 2:43 AM > >> To: Attila Megyeri > >> Cc: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 13 Mar 2014, at 11:44 pm, Attila Megyeri soft.com> > >> wrote: > >> > >>> Hello, > >>> > >>>> -Original Message- > >>>> From: Jan Friesse [mailto:jfrie...@redhat.com] > >>>> Sent: Thursday, March 13, 2014 10:03 AM > >>>> To: The Pacemaker cluster resource manager > >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>> > >>>> ... > >>>> > >>>> Attila, > >>>> what seems to be interesting is > >>>> > >>>> Configuration ERRORs found during PE processing. Please run > "crm_verify > >> -L" > >>>> to identify issues. > >>>> > >>>> I'm unsure how much is this problem but I'm really not pacemaker > expert. > >>> > >>> Perhaps Andrew could comment on that. Any idea? > >> > >> Did you run the command? What did it say? > > > > Yes, all was fine. This is why I found it strange. > > If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then > I should be able to figure out what it was complaining about. > (You can also run: crm_verify --xml-file /var/lib/pacemaker/pengine/pe- > error-7.bz2 -V ) The file is still there, and crm_veryfy check is successful (error 0) and no output. The file is full of confidential data but if you think you can find something useful in it I can send it in a direct mail. thanks! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
On 18 Mar 2014, at 6:03 pm, Attila Megyeri wrote: > Hello, > >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Tuesday, March 18, 2014 2:43 AM >> To: Attila Megyeri >> Cc: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 13 Mar 2014, at 11:44 pm, Attila Megyeri >> wrote: >> >>> Hello, >>> >>>> -Original Message- >>>> From: Jan Friesse [mailto:jfrie...@redhat.com] >>>> Sent: Thursday, March 13, 2014 10:03 AM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> ... >>>> >>>> Attila, >>>> what seems to be interesting is >>>> >>>> Configuration ERRORs found during PE processing. Please run "crm_verify >> -L" >>>> to identify issues. >>>> >>>> I'm unsure how much is this problem but I'm really not pacemaker expert. >>> >>> Perhaps Andrew could comment on that. Any idea? >> >> Did you run the command? What did it say? > > Yes, all was fine. This is why I found it strange. If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then I should be able to figure out what it was complaining about. (You can also run: crm_verify --xml-file /var/lib/pacemaker/pengine/pe-error-7.bz2 -V ) signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hello, > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 18, 2014 2:43 AM > To: Attila Megyeri > Cc: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 13 Mar 2014, at 11:44 pm, Attila Megyeri > wrote: > > > Hello, > > > >> -Original Message- > >> From: Jan Friesse [mailto:jfrie...@redhat.com] > >> Sent: Thursday, March 13, 2014 10:03 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> ... > >> > >>>>>> > >>>>>> Also can you please try to set debug: on in corosync.conf and > >>>>>> paste full corosync.log then? > >>>>> > >>>>> I set debug to on, and did a few restarts but could not reproduce > >>>>> the issue > >>>> yet - will post the logs as soon as I manage to reproduce. > >>>>> > >>>> > >>>> Perfect. > >>>> > >>>> Another option you can try to set is netmtu (1200 is usually safe). > >>> > >>> Finally I was able to reproduce the issue. > >>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately > >>> (not > >> when node was up again). > >>> > >>> The corosync log with debug on is available at: > >>> http://pastebin.com/kTpDqqtm > >>> > >>> > >>> To be honest, I had to wait much longer for this reproduction as > >>> before, > >> even though there was no change in the corosync configuration - just > >> potentially some system updates. But anyway, the issue is > >> unfortunately still there. > >>> Previously, when this issue came, cpu was at 100% on all nodes - > >>> this time > >> only on ctmgr, which was the DC... > >>> > >>> I hope you can find some useful details in the log. > >>> > >> > >> Attila, > >> what seems to be interesting is > >> > >> Configuration ERRORs found during PE processing. Please run "crm_verify > -L" > >> to identify issues. > >> > >> I'm unsure how much is this problem but I'm really not pacemaker expert. > > > > Perhaps Andrew could comment on that. Any idea? > > Did you run the command? What did it say? Yes, all was fine. This is why I found it strange. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
On 13 Mar 2014, at 11:44 pm, Attila Megyeri wrote: > Hello, > >> -Original Message- >> From: Jan Friesse [mailto:jfrie...@redhat.com] >> Sent: Thursday, March 13, 2014 10:03 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> ... >> >>>>>> >>>>>> Also can you please try to set debug: on in corosync.conf and paste >>>>>> full corosync.log then? >>>>> >>>>> I set debug to on, and did a few restarts but could not reproduce >>>>> the issue >>>> yet - will post the logs as soon as I manage to reproduce. >>>>> >>>> >>>> Perfect. >>>> >>>> Another option you can try to set is netmtu (1200 is usually safe). >>> >>> Finally I was able to reproduce the issue. >>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not >> when node was up again). >>> >>> The corosync log with debug on is available at: >>> http://pastebin.com/kTpDqqtm >>> >>> >>> To be honest, I had to wait much longer for this reproduction as before, >> even though there was no change in the corosync configuration - just >> potentially some system updates. But anyway, the issue is unfortunately still >> there. >>> Previously, when this issue came, cpu was at 100% on all nodes - this time >> only on ctmgr, which was the DC... >>> >>> I hope you can find some useful details in the log. >>> >> >> Attila, >> what seems to be interesting is >> >> Configuration ERRORs found during PE processing. Please run "crm_verify -L" >> to identify issues. >> >> I'm unsure how much is this problem but I'm really not pacemaker expert. > > Perhaps Andrew could comment on that. Any idea? Did you run the command? What did it say? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hi David, Jan, For the time being corosync 2.3.3 looks stable with libqb 0.17.0 with both build from source. Thank you very much for the guidance! Attila > -Original Message- > From: David Vossel [mailto:dvos...@redhat.com] > Sent: Thursday, March 13, 2014 9:22 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > > > - Original Message - > > From: "Jan Friesse" > > To: "The Pacemaker cluster resource manager" > > > > Sent: Thursday, March 13, 2014 4:03:28 AM > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > ... > > > > >>>> > > >>>> Also can you please try to set debug: on in corosync.conf and > > >>>> paste full corosync.log then? > > >>> > > >>> I set debug to on, and did a few restarts but could not reproduce > > >>> the issue > > >> yet - will post the logs as soon as I manage to reproduce. > > >>> > > >> > > >> Perfect. > > >> > > >> Another option you can try to set is netmtu (1200 is usually safe). > > > > > > Finally I was able to reproduce the issue. > > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately > > > (not when node was up again). > > > > > > The corosync log with debug on is available at: > > > http://pastebin.com/kTpDqqtm > > > > > > > > > To be honest, I had to wait much longer for this reproduction as > > > before, even though there was no change in the corosync > > > configuration - just potentially some system updates. But anyway, > > > the issue is unfortunately still there. > > > Previously, when this issue came, cpu was at 100% on all nodes - > > > this time only on ctmgr, which was the DC... > > > > > > I hope you can find some useful details in the log. > > > > > > > Attila, > > what seems to be interesting is > > > > Configuration ERRORs found during PE processing. Please run > > "crm_verify -L" to identify issues. > > > > I'm unsure how much is this problem but I'm really not pacemaker expert. > > > > Anyway, I have theory what may happening and it looks like related > > with IPC (and probably not related to network). But to make sure we > > will not try fixing already fixed bug, can you please build: > > - New libqb (0.17.0). There are plenty of fixes in IPC > > - Corosync 2.3.3 (already plenty IPC fixes) > > yes, there was a libqb/corosync interoperation problem that showed these > same symptoms last year. Updating to the latest corosync and libqb will likely > resolve this. > > > - And maybe also newer pacemaker > > > > I know you were not very happy using hand-compiled sources, but please > > give them at least a try. > > > > Thanks, > > Honza > > > > > Thanks, > > > Attila > > > > > > > > > > > >> > > >> Regards, > > >> Honza > > >> > > >>> > > >>> There are also a few things that might or might not be related: > > >>> > > >>> 1) Whenever I want to edit the configuration with "crm configure > > >>> edit", > > > > ... > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): > Hi Honza, > > What I also found in the log related to the freeze at 12:22:26: > > > Corosync main process was not scheduled for ... Can It be the general > cause of the issue? > I don't think it will cause issue you are hitting BUT keep in mind that if corosync is not scheduled for long time, it's probably fenced by other node. So increase timeout is vital. Honza > > > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:58597->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:47943->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:47943->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:59647->[10.9.1.3]:161 > > > Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was > not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token > timeout increase. > > > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the > OPERATIONAL state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming > new configuration. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from > 2(The token was lost in the OPERATIONAL state.). > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token > because I am the rep. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high > seq received 6a8c > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for > ring 7dc > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: > > > > > Regards, > Attila > >> -Original Message- >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] >> Sent: Thursday, March 13, 2014 2:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >>> -Original Message- >>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] >>> Sent: Thursday, March 13, 2014 1:45 PM >>> To: The Pacemaker cluster resource manager; Andrew Beekhof >>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>> >>> Hello, >>> >>>> -Original Message- >>>> From: Jan Friesse [mailto:jfrie...@redhat.com] >>>> Sent: Thursday, March 13, 2014 10:03 AM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> ... >>>> >>>>>>>> >>>>>>>> Also can you please try to set debug: on in corosync.conf and >>>>>>>> paste full corosync.log then? >>>>>>> >>>>>>> I set debug to on, and did a few restarts but could not >>>>>>> reproduce the issue >>>>>> yet - will post the logs as soon as I manage to reproduce. >>>>>>> >>>>>> >>>>>> Perfect. >>>>>> >>>>>> Another option you can try to set is netmtu (1200 is usually safe). >>>>> >>>>> Finally I was able to reproduce the issue. >>>>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately >>>>> (not >>>> when node was up again). >>>>> >>>>> The corosync log with debug on is available at: >>>>> http://pastebin.com/kTpDqqtm >>>>> >>>>> >>>>> To be honest, I had to wait much longer for this reproduction as >>>>> before, >>>> even though there was no change in the corosync configuration - just >>>> potentially some system updates. But anyway, the issue is >>>> unfortunately still there. >>>>> Previously, when this issue
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): > Hi Honza, > > What I also found in the log related to the freeze at 12:22:26: > > > Corosync main process was not scheduled for ... Can It be the general > cause of the issue? > > > > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:58597->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:47943->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:47943->[10.9.1.3]:161 > Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: > [10.9.1.5]:59647->[10.9.1.3]:161 > > > Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was > not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token > timeout increase. > > > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the > OPERATIONAL state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming > new configuration. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from > 2(The token was lost in the OPERATIONAL state.). > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token > because I am the rep. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high > seq received 6a8c > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for > ring 7dc > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: > Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: > > > > > Regards, > Attila > >> -Original Message- >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] >> Sent: Thursday, March 13, 2014 2:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >>> -Original Message- >>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] >>> Sent: Thursday, March 13, 2014 1:45 PM >>> To: The Pacemaker cluster resource manager; Andrew Beekhof >>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>> >>> Hello, >>> >>>> -Original Message- >>>> From: Jan Friesse [mailto:jfrie...@redhat.com] >>>> Sent: Thursday, March 13, 2014 10:03 AM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> ... >>>> >>>>>>>> >>>>>>>> Also can you please try to set debug: on in corosync.conf and >>>>>>>> paste full corosync.log then? >>>>>>> >>>>>>> I set debug to on, and did a few restarts but could not >>>>>>> reproduce the issue >>>>>> yet - will post the logs as soon as I manage to reproduce. >>>>>>> >>>>>> >>>>>> Perfect. >>>>>> >>>>>> Another option you can try to set is netmtu (1200 is usually safe). >>>>> >>>>> Finally I was able to reproduce the issue. >>>>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately >>>>> (not >>>> when node was up again). >>>>> >>>>> The corosync log with debug on is available at: >>>>> http://pastebin.com/kTpDqqtm >>>>> >>>>> >>>>> To be honest, I had to wait much longer for this reproduction as >>>>> before, >>>> even though there was no change in the corosync configuration - just >>>> potentially some system updates. But anyway, the issue is >>>> unfortunately still there. >>>>> Previously, when this issue came, cpu was at 100% on all nodes - >>>>> this time >>>> only on ctmgr, which was the DC... >>>>> >>>>> I hope you can find some useful details in
Re: [Pacemaker] Pacemaker/corosync freeze
Hello David, > -Original Message- > From: David Vossel [mailto:dvos...@redhat.com] > Sent: Thursday, March 13, 2014 9:22 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > > > - Original Message - > > From: "Jan Friesse" > > To: "The Pacemaker cluster resource manager" > > > > Sent: Thursday, March 13, 2014 4:03:28 AM > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > ... > > > > >>>> > > >>>> Also can you please try to set debug: on in corosync.conf and > > >>>> paste full corosync.log then? > > >>> > > >>> I set debug to on, and did a few restarts but could not reproduce > > >>> the issue > > >> yet - will post the logs as soon as I manage to reproduce. > > >>> > > >> > > >> Perfect. > > >> > > >> Another option you can try to set is netmtu (1200 is usually safe). > > > > > > Finally I was able to reproduce the issue. > > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately > > > (not when node was up again). > > > > > > The corosync log with debug on is available at: > > > http://pastebin.com/kTpDqqtm > > > > > > > > > To be honest, I had to wait much longer for this reproduction as > > > before, even though there was no change in the corosync > > > configuration - just potentially some system updates. But anyway, > > > the issue is unfortunately still there. > > > Previously, when this issue came, cpu was at 100% on all nodes - > > > this time only on ctmgr, which was the DC... > > > > > > I hope you can find some useful details in the log. > > > > > > > Attila, > > what seems to be interesting is > > > > Configuration ERRORs found during PE processing. Please run > > "crm_verify -L" to identify issues. > > > > I'm unsure how much is this problem but I'm really not pacemaker expert. > > > > Anyway, I have theory what may happening and it looks like related > > with IPC (and probably not related to network). But to make sure we > > will not try fixing already fixed bug, can you please build: > > - New libqb (0.17.0). There are plenty of fixes in IPC > > - Corosync 2.3.3 (already plenty IPC fixes) > > yes, there was a libqb/corosync interoperation problem that showed these > same symptoms last year. Updating to the latest corosync and libqb will likely > resolve this. I have upgraded all nodes to these version and we are testing. So far no issues. Thank you very much for your help. Regards, Attila > > > - And maybe also newer pacemaker > > > > I know you were not very happy using hand-compiled sources, but please > > give them at least a try. > > > > Thanks, > > Honza > > > > > Thanks, > > > Attila > > > > > > > > > > > >> > > >> Regards, > > >> Honza > > >> > > >>> > > >>> There are also a few things that might or might not be related: > > >>> > > >>> 1) Whenever I want to edit the configuration with "crm configure > > >>> edit", > > > > ... > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
- Original Message - > From: "Jan Friesse" > To: "The Pacemaker cluster resource manager" > Sent: Thursday, March 13, 2014 4:03:28 AM > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > ... > > >>>> > >>>> Also can you please try to set debug: on in corosync.conf and paste > >>>> full corosync.log then? > >>> > >>> I set debug to on, and did a few restarts but could not reproduce the > >>> issue > >> yet - will post the logs as soon as I manage to reproduce. > >>> > >> > >> Perfect. > >> > >> Another option you can try to set is netmtu (1200 is usually safe). > > > > Finally I was able to reproduce the issue. > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not > > when node was up again). > > > > The corosync log with debug on is available at: > > http://pastebin.com/kTpDqqtm > > > > > > To be honest, I had to wait much longer for this reproduction as before, > > even though there was no change in the corosync configuration - just > > potentially some system updates. But anyway, the issue is unfortunately > > still there. > > Previously, when this issue came, cpu was at 100% on all nodes - this time > > only on ctmgr, which was the DC... > > > > I hope you can find some useful details in the log. > > > > Attila, > what seems to be interesting is > > Configuration ERRORs found during PE processing. Please run "crm_verify > -L" to identify issues. > > I'm unsure how much is this problem but I'm really not pacemaker expert. > > Anyway, I have theory what may happening and it looks like related with > IPC (and probably not related to network). But to make sure we will not > try fixing already fixed bug, can you please build: > - New libqb (0.17.0). There are plenty of fixes in IPC > - Corosync 2.3.3 (already plenty IPC fixes) yes, there was a libqb/corosync interoperation problem that showed these same symptoms last year. Updating to the latest corosync and libqb will likely resolve this. > - And maybe also newer pacemaker > > I know you were not very happy using hand-compiled sources, but please > give them at least a try. > > Thanks, > Honza > > > Thanks, > > Attila > > > > > > > >> > >> Regards, > >> Honza > >> > >>> > >>> There are also a few things that might or might not be related: > >>> > >>> 1) Whenever I want to edit the configuration with "crm configure edit", > > ... > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hi Honza, What I also found in the log related to the freeze at 12:22:26: Corosync main process was not scheduled for ... Can It be the general cause of the issue? Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597->[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647->[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: Regards, Attila > -Original Message- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Thursday, March 13, 2014 2:27 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > -Original Message- > > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > > Sent: Thursday, March 13, 2014 1:45 PM > > To: The Pacemaker cluster resource manager; Andrew Beekhof > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > Hello, > > > > > -Original Message- > > > From: Jan Friesse [mailto:jfrie...@redhat.com] > > > Sent: Thursday, March 13, 2014 10:03 AM > > > To: The Pacemaker cluster resource manager > > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > > > ... > > > > > > >>>> > > > >>>> Also can you please try to set debug: on in corosync.conf and > > > >>>> paste full corosync.log then? > > > >>> > > > >>> I set debug to on, and did a few restarts but could not > > > >>> reproduce the issue > > > >> yet - will post the logs as soon as I manage to reproduce. > > > >>> > > > >> > > > >> Perfect. > > > >> > > > >> Another option you can try to set is netmtu (1200 is usually safe). > > > > > > > > Finally I was able to reproduce the issue. > > > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately > > > > (not > > > when node was up again). > > > > > > > > The corosync log with debug on is available at: > > > > http://pastebin.com/kTpDqqtm > > > > > > > > > > > > To be honest, I had to wait much longer for this reproduction as > > > > before, > > > even though there was no change in the corosync configuration - just > > > potentially some system updates. But anyway, the issue is > > > unfortunately still there. > > > > Previously, when this issue came, cpu was at 100% on all nodes - > > > > this time > > > only on ctmgr, which was the DC... > > > > > > > > I hope you can find some useful details in the log. > > > > > > > > > > Attila, > > > what seems to be interesting is > > > > > > Configuration ERRORs found during PE processing. Please run > > > "crm_verify - > > L" > > > to identify issues. > > > > > > I'm unsure how much is this problem but I'm really not pacemaker &
Re: [Pacemaker] Pacemaker/corosync freeze
> -Original Message- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Thursday, March 13, 2014 1:45 PM > To: The Pacemaker cluster resource manager; Andrew Beekhof > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Hello, > > > -Original Message- > > From: Jan Friesse [mailto:jfrie...@redhat.com] > > Sent: Thursday, March 13, 2014 10:03 AM > > To: The Pacemaker cluster resource manager > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > ... > > > > >>>> > > >>>> Also can you please try to set debug: on in corosync.conf and > > >>>> paste full corosync.log then? > > >>> > > >>> I set debug to on, and did a few restarts but could not reproduce > > >>> the issue > > >> yet - will post the logs as soon as I manage to reproduce. > > >>> > > >> > > >> Perfect. > > >> > > >> Another option you can try to set is netmtu (1200 is usually safe). > > > > > > Finally I was able to reproduce the issue. > > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately > > > (not > > when node was up again). > > > > > > The corosync log with debug on is available at: > > > http://pastebin.com/kTpDqqtm > > > > > > > > > To be honest, I had to wait much longer for this reproduction as > > > before, > > even though there was no change in the corosync configuration - just > > potentially some system updates. But anyway, the issue is > > unfortunately still there. > > > Previously, when this issue came, cpu was at 100% on all nodes - > > > this time > > only on ctmgr, which was the DC... > > > > > > I hope you can find some useful details in the log. > > > > > > > Attila, > > what seems to be interesting is > > > > Configuration ERRORs found during PE processing. Please run "crm_verify - > L" > > to identify issues. > > > > I'm unsure how much is this problem but I'm really not pacemaker expert. > > Perhaps Andrew could comment on that. Any idea? > > > > > > Anyway, I have theory what may happening and it looks like related > > with IPC (and probably not related to network). But to make sure we > > will not try fixing already fixed bug, can you please build: > > - New libqb (0.17.0). There are plenty of fixes in IPC > > - Corosync 2.3.3 (already plenty IPC fixes) > > - And maybe also newer pacemaker > > > > I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from > Ubuntu package. > I am currently building libqb 0.17.0, will update you on the results. > > In the meantime we had another freeze, which did not seem to be related to > any restarts, but brought all coroync processes to 100%. > Please check out the corosync.log, perhaps it is a different cause: > http://pastebin.com/WMwzv0Rr > > > In the meantime I will install the new libqb and send logs if we have further > issues. > > Thank you very much for your help! > > Regards, > Attila > One more question: If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, or if it was built with libqb 0.16.0 it will be fine? BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can see if it makes a difference. If I see crashes on the outdated ones, but not on the new ones, we are fine. :) Thanks, Attila > > > > I know you were not very happy using hand-compiled sources, but please > > give them at least a try. > > > > Thanks, > > Honza > > > > > Thanks, > > > Attila > > > > > > > > > > > >> > > >> Regards, > > >> Honza > > >> > > >>> > > >>> There are also a few things that might or might not be related: > > >>> > > >>> 1) Whenever I want to edit the configuration with "crm configure > > >>> edit", > > > > ... > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hello, > -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Thursday, March 13, 2014 10:03 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > ... > > >>>> > >>>> Also can you please try to set debug: on in corosync.conf and paste > >>>> full corosync.log then? > >>> > >>> I set debug to on, and did a few restarts but could not reproduce > >>> the issue > >> yet - will post the logs as soon as I manage to reproduce. > >>> > >> > >> Perfect. > >> > >> Another option you can try to set is netmtu (1200 is usually safe). > > > > Finally I was able to reproduce the issue. > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not > when node was up again). > > > > The corosync log with debug on is available at: > > http://pastebin.com/kTpDqqtm > > > > > > To be honest, I had to wait much longer for this reproduction as before, > even though there was no change in the corosync configuration - just > potentially some system updates. But anyway, the issue is unfortunately still > there. > > Previously, when this issue came, cpu was at 100% on all nodes - this time > only on ctmgr, which was the DC... > > > > I hope you can find some useful details in the log. > > > > Attila, > what seems to be interesting is > > Configuration ERRORs found during PE processing. Please run "crm_verify -L" > to identify issues. > > I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? > > Anyway, I have theory what may happening and it looks like related with IPC > (and probably not related to network). But to make sure we will not try fixing > already fixed bug, can you please build: > - New libqb (0.17.0). There are plenty of fixes in IPC > - Corosync 2.3.3 (already plenty IPC fixes) > - And maybe also newer pacemaker > I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr In the meantime I will install the new libqb and send logs if we have further issues. Thank you very much for your help! Regards, Attila > I know you were not very happy using hand-compiled sources, but please > give them at least a try. > > Thanks, > Honza > > > Thanks, > > Attila > > > > > > > >> > >> Regards, > >> Honza > >> > >>> > >>> There are also a few things that might or might not be related: > >>> > >>> 1) Whenever I want to edit the configuration with "crm configure > >>> edit", > > ... > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? >>> >>> I set debug to on, and did a few restarts but could not reproduce the issue >> yet - will post the logs as soon as I manage to reproduce. >>> >> >> Perfect. >> >> Another option you can try to set is netmtu (1200 is usually safe). > > Finally I was able to reproduce the issue. > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when > node was up again). > > The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm > > > To be honest, I had to wait much longer for this reproduction as before, even > though there was no change in the corosync configuration - just potentially > some system updates. But anyway, the issue is unfortunately still there. > Previously, when this issue came, cpu was at 100% on all nodes - this time > only on ctmgr, which was the DC... > > I hope you can find some useful details in the log. > Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza > Thanks, > Attila > > > >> >> Regards, >> Honza >> >>> >>> There are also a few things that might or might not be related: >>> >>> 1) Whenever I want to edit the configuration with "crm configure edit", ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
> -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Wednesday, March 12, 2014 4:31 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Attila Megyeri napsal(a): > >> -Original Message- > >> From: Jan Friesse [mailto:jfrie...@redhat.com] > >> Sent: Wednesday, March 12, 2014 2:27 PM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> Attila Megyeri napsal(a): > >>> Hello Jan, > >>> > >>> Thank you very much for your help so far. > >>> > >>>> -Original Message- > >>>> From: Jan Friesse [mailto:jfrie...@redhat.com] > >>>> Sent: Wednesday, March 12, 2014 9:51 AM > >>>> To: The Pacemaker cluster resource manager > >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>> > >>>> Attila Megyeri napsal(a): > >>>>> > >>>>>> -Original Message- > >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>>>> Sent: Tuesday, March 11, 2014 10:27 PM > >>>>>> To: The Pacemaker cluster resource manager > >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>>>> > >>>>>> > >>>>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>>> > >>>>>>>> -Original Message- > >>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>>>>>> Sent: Tuesday, March 11, 2014 12:48 AM > >>>>>>>> To: The Pacemaker cluster resource manager > >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>>>>>> > >>>>>>>> > >>>>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri > >>>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Thanks for the quick response! > >>>>>>>>> > >>>>>>>>>> -Original Message- > >>>>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>>>>>>>> Sent: Friday, March 07, 2014 3:48 AM > >>>>>>>>>> To: The Pacemaker cluster resource manager > >>>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri > >>>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hello, > >>>>>>>>>>> > >>>>>>>>>>> We have a strange issue with Corosync/Pacemaker. > >>>>>>>>>>> From time to time, something unexpected happens and > >> suddenly > >>>> the > >>>>>>>>>> crm_mon output remains static. > >>>>>>>>>>> When I check the cpu usage, I see that one of the cores uses > >>>>>>>>>>> 100% cpu, but > >>>>>>>>>> cannot actually match it to either the corosync or one of the > >>>>>>>>>> pacemaker processes. > >>>>>>>>>>> > >>>>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes. > >>>>>>>>>>> I have to manually go to each node, stop pacemaker, restart > >>>>>>>>>>> corosync, then > >>>>>>>>>> start pacemeker. Stoping pacemaker and corosync does not > work > >>>>>>>>>> in most of the cases, usually a kill -9 is needed. > >>>>>>>>>>> > >>>>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > >>>>>>>>>>> > >>>>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode > >>>
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): >> -Original Message- >> From: Jan Friesse [mailto:jfrie...@redhat.com] >> Sent: Wednesday, March 12, 2014 2:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> Attila Megyeri napsal(a): >>> Hello Jan, >>> >>> Thank you very much for your help so far. >>> >>>> -Original Message- >>>> From: Jan Friesse [mailto:jfrie...@redhat.com] >>>> Sent: Wednesday, March 12, 2014 9:51 AM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> Attila Megyeri napsal(a): >>>>> >>>>>> -----Original Message- >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>> Sent: Tuesday, March 11, 2014 10:27 PM >>>>>> To: The Pacemaker cluster resource manager >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>> >>>>>> >>>>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri >>>>>> >>>>>> wrote: >>>>>> >>>>>>>> >>>>>>>> -Original Message- >>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>>>> Sent: Tuesday, March 11, 2014 12:48 AM >>>>>>>> To: The Pacemaker cluster resource manager >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>>>> >>>>>>>> >>>>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the quick response! >>>>>>>>> >>>>>>>>>> -Original Message- >>>>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>>>>>> Sent: Friday, March 07, 2014 3:48 AM >>>>>>>>>> To: The Pacemaker cluster resource manager >>>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> We have a strange issue with Corosync/Pacemaker. >>>>>>>>>>> From time to time, something unexpected happens and >> suddenly >>>> the >>>>>>>>>> crm_mon output remains static. >>>>>>>>>>> When I check the cpu usage, I see that one of the cores uses >>>>>>>>>>> 100% cpu, but >>>>>>>>>> cannot actually match it to either the corosync or one of the >>>>>>>>>> pacemaker processes. >>>>>>>>>>> >>>>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes. >>>>>>>>>>> I have to manually go to each node, stop pacemaker, restart >>>>>>>>>>> corosync, then >>>>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work >>>>>>>>>> in most of the cases, usually a kill -9 is needed. >>>>>>>>>>> >>>>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>>>>>>>>>> >>>>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode >>>> passive. >>>>>>>>>>> >>>>>>>>>>> Logs are usually flooded with CPG related messages, such as: >>>>>>>>>>> >>>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent >>>> 0 >>>>>>>> CPG >>>>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent >>>> 0 >>>>
Re: [Pacemaker] Pacemaker/corosync freeze
> -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Wednesday, March 12, 2014 2:27 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Attila Megyeri napsal(a): > > Hello Jan, > > > > Thank you very much for your help so far. > > > >> -Original Message- > >> From: Jan Friesse [mailto:jfrie...@redhat.com] > >> Sent: Wednesday, March 12, 2014 9:51 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> Attila Megyeri napsal(a): > >>> > >>>> -Original Message- > >>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>> Sent: Tuesday, March 11, 2014 10:27 PM > >>>> To: The Pacemaker cluster resource manager > >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>> > >>>> > >>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri > >>>> > >>>> wrote: > >>>> > >>>>>> > >>>>>> -Original Message- > >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>>>> Sent: Tuesday, March 11, 2014 12:48 AM > >>>>>> To: The Pacemaker cluster resource manager > >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>>>> > >>>>>> > >>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> Thanks for the quick response! > >>>>>>> > >>>>>>>> -Original Message- > >>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>>>>>> Sent: Friday, March 07, 2014 3:48 AM > >>>>>>>> To: The Pacemaker cluster resource manager > >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>>>>>> > >>>>>>>> > >>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri > >>>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hello, > >>>>>>>>> > >>>>>>>>> We have a strange issue with Corosync/Pacemaker. > >>>>>>>>> From time to time, something unexpected happens and > suddenly > >> the > >>>>>>>> crm_mon output remains static. > >>>>>>>>> When I check the cpu usage, I see that one of the cores uses > >>>>>>>>> 100% cpu, but > >>>>>>>> cannot actually match it to either the corosync or one of the > >>>>>>>> pacemaker processes. > >>>>>>>>> > >>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes. > >>>>>>>>> I have to manually go to each node, stop pacemaker, restart > >>>>>>>>> corosync, then > >>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work > >>>>>>>> in most of the cases, usually a kill -9 is needed. > >>>>>>>>> > >>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > >>>>>>>>> > >>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode > >> passive. > >>>>>>>>> > >>>>>>>>> Logs are usually flooded with CPG related messages, such as: > >>>>>>>>> > >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent > >> 0 > >>>>>> CPG > >>>>>>>> messages (1 remaining, last=8): Try again (6) > >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent > >> 0 > >>>>>> CPG > >>>>>>>> messages (1 remaining, last=8): Try again (6) > >>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > Sent > >> 0 > >>>>>> CPG > >>>>>&g
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): > Hello Jan, > > Thank you very much for your help so far. > >> -Original Message- >> From: Jan Friesse [mailto:jfrie...@redhat.com] >> Sent: Wednesday, March 12, 2014 9:51 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> Attila Megyeri napsal(a): >>> >>>> -Original Message- >>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>> Sent: Tuesday, March 11, 2014 10:27 PM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> >>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri >>>> >>>> wrote: >>>> >>>>>> >>>>>> -Original Message- >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>> Sent: Tuesday, March 11, 2014 12:48 AM >>>>>> To: The Pacemaker cluster resource manager >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>> >>>>>> >>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the quick response! >>>>>>> >>>>>>>> -Original Message- >>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>>>> Sent: Friday, March 07, 2014 3:48 AM >>>>>>>> To: The Pacemaker cluster resource manager >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>>>> >>>>>>>> >>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> We have a strange issue with Corosync/Pacemaker. >>>>>>>>> From time to time, something unexpected happens and suddenly >> the >>>>>>>> crm_mon output remains static. >>>>>>>>> When I check the cpu usage, I see that one of the cores uses >>>>>>>>> 100% cpu, but >>>>>>>> cannot actually match it to either the corosync or one of the >>>>>>>> pacemaker processes. >>>>>>>>> >>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes. >>>>>>>>> I have to manually go to each node, stop pacemaker, restart >>>>>>>>> corosync, then >>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work in >>>>>>>> most of the cases, usually a kill -9 is needed. >>>>>>>>> >>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>>>>>>>> >>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode >> passive. >>>>>>>>> >>>>>>>>> Logs are usually flooded with CPG related messages, such as: >>>>>>>>> >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>>>> Sent >> 0 >>>>>> CPG >>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>>>> Sent >> 0 >>>>>> CPG >>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>>>> Sent >> 0 >>>>>> CPG >>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>>>> Sent >> 0 >>>>>> CPG >>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>> >>>>>>>>> OR >>>>>>>>> >>>>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_c
Re: [Pacemaker] Pacemaker/corosync freeze
Hello Jan, Thank you very much for your help so far. > -Original Message- > From: Jan Friesse [mailto:jfrie...@redhat.com] > Sent: Wednesday, March 12, 2014 9:51 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Attila Megyeri napsal(a): > > > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Tuesday, March 11, 2014 10:27 PM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 12 Mar 2014, at 1:54 am, Attila Megyeri > >> > >> wrote: > >> > >>>> > >>>> -Original Message- > >>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>> Sent: Tuesday, March 11, 2014 12:48 AM > >>>> To: The Pacemaker cluster resource manager > >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>> > >>>> > >>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri > >>>> > >>>> wrote: > >>>> > >>>>> Thanks for the quick response! > >>>>> > >>>>>> -Original Message- > >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>>>> Sent: Friday, March 07, 2014 3:48 AM > >>>>>> To: The Pacemaker cluster resource manager > >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>>>> > >>>>>> > >>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> We have a strange issue with Corosync/Pacemaker. > >>>>>>> From time to time, something unexpected happens and suddenly > the > >>>>>> crm_mon output remains static. > >>>>>>> When I check the cpu usage, I see that one of the cores uses > >>>>>>> 100% cpu, but > >>>>>> cannot actually match it to either the corosync or one of the > >>>>>> pacemaker processes. > >>>>>>> > >>>>>>> In such a case, this high CPU usage is happening on all 7 nodes. > >>>>>>> I have to manually go to each node, stop pacemaker, restart > >>>>>>> corosync, then > >>>>>> start pacemeker. Stoping pacemaker and corosync does not work in > >>>>>> most of the cases, usually a kill -9 is needed. > >>>>>>> > >>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > >>>>>>> > >>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode > passive. > >>>>>>> > >>>>>>> Logs are usually flooded with CPG related messages, such as: > >>>>>>> > >>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>>>> Sent > 0 > >>>> CPG > >>>>>> messages (1 remaining, last=8): Try again (6) > >>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>>>> Sent > 0 > >>>> CPG > >>>>>> messages (1 remaining, last=8): Try again (6) > >>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>>>> Sent > 0 > >>>> CPG > >>>>>> messages (1 remaining, last=8): Try again (6) > >>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>>>> Sent > 0 > >>>> CPG > >>>>>> messages (1 remaining, last=8): Try again (6) > >>>>>>> > >>>>>>> OR > >>>>>>> > >>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>>>>>> Sent 0 > >> CPG > >>>>>> messages (1 remaining, last=10933): Try again ( > >>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>>>>>> Sent 0 > >> CPG > >>>>>&
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): > >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Tuesday, March 11, 2014 10:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 12 Mar 2014, at 1:54 am, Attila Megyeri >> wrote: >> >>>> >>>> -Original Message- >>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>> Sent: Tuesday, March 11, 2014 12:48 AM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> >>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri >>>> wrote: >>>> >>>>> Thanks for the quick response! >>>>> >>>>>> -Original Message- >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>> Sent: Friday, March 07, 2014 3:48 AM >>>>>> To: The Pacemaker cluster resource manager >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>> >>>>>> >>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> We have a strange issue with Corosync/Pacemaker. >>>>>>> From time to time, something unexpected happens and suddenly the >>>>>> crm_mon output remains static. >>>>>>> When I check the cpu usage, I see that one of the cores uses 100% >>>>>>> cpu, but >>>>>> cannot actually match it to either the corosync or one of the >>>>>> pacemaker processes. >>>>>>> >>>>>>> In such a case, this high CPU usage is happening on all 7 nodes. >>>>>>> I have to manually go to each node, stop pacemaker, restart >>>>>>> corosync, then >>>>>> start pacemeker. Stoping pacemaker and corosync does not work in >>>>>> most of the cases, usually a kill -9 is needed. >>>>>>> >>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>>>>>> >>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. >>>>>>> >>>>>>> Logs are usually flooded with CPG related messages, such as: >>>>>>> >>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>> Sent 0 >>>> CPG >>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>> Sent 0 >>>> CPG >>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>> Sent 0 >>>> CPG >>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>> Sent 0 >>>> CPG >>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>> >>>>>>> OR >>>>>>> >>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>>>>>> Sent 0 >> CPG >>>>>> messages (1 remaining, last=10933): Try again ( >>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>>>>>> Sent 0 >> CPG >>>>>> messages (1 remaining, last=10933): Try again ( >>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>>>>>> Sent 0 >> CPG >>>>>> messages (1 remaining, last=10933): Try again ( >>>>>> >>>>>> That is usually a symptom of corosync getting into a horribly >>>>>> confused >>>> state. >>>>>> Version? Distro? Have you checked for an update? >>>>>> Odd that the user of all that CPU isn't showing up though. >>>>>> >>>>>>> >>>>> >>>>> As I wrote I use Ubuntu trusty, the exact package versions are: >>>>
Re: [Pacemaker] Pacemaker/corosync freeze
> -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 11, 2014 10:27 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 12 Mar 2014, at 1:54 am, Attila Megyeri > wrote: > > >> > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Tuesday, March 11, 2014 12:48 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri > >> wrote: > >> > >>> Thanks for the quick response! > >>> > >>>> -Original Message- > >>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>> Sent: Friday, March 07, 2014 3:48 AM > >>>> To: The Pacemaker cluster resource manager > >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>> > >>>> > >>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri > >>>> > >>>> wrote: > >>>> > >>>>> Hello, > >>>>> > >>>>> We have a strange issue with Corosync/Pacemaker. > >>>>> From time to time, something unexpected happens and suddenly the > >>>> crm_mon output remains static. > >>>>> When I check the cpu usage, I see that one of the cores uses 100% > >>>>> cpu, but > >>>> cannot actually match it to either the corosync or one of the > >>>> pacemaker processes. > >>>>> > >>>>> In such a case, this high CPU usage is happening on all 7 nodes. > >>>>> I have to manually go to each node, stop pacemaker, restart > >>>>> corosync, then > >>>> start pacemeker. Stoping pacemaker and corosync does not work in > >>>> most of the cases, usually a kill -9 is needed. > >>>>> > >>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > >>>>> > >>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. > >>>>> > >>>>> Logs are usually flooded with CPG related messages, such as: > >>>>> > >>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>> Sent 0 > >> CPG > >>>> messages (1 remaining, last=8): Try again (6) > >>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>> Sent 0 > >> CPG > >>>> messages (1 remaining, last=8): Try again (6) > >>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>> Sent 0 > >> CPG > >>>> messages (1 remaining, last=8): Try again (6) > >>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>> Sent 0 > >> CPG > >>>> messages (1 remaining, last=8): Try again (6) > >>>>> > >>>>> OR > >>>>> > >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>>>> Sent 0 > CPG > >>>> messages (1 remaining, last=10933): Try again ( > >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>>>> Sent 0 > CPG > >>>> messages (1 remaining, last=10933): Try again ( > >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>>>> Sent 0 > CPG > >>>> messages (1 remaining, last=10933): Try again ( > >>>> > >>>> That is usually a symptom of corosync getting into a horribly > >>>> confused > >> state. > >>>> Version? Distro? Have you checked for an update? > >>>> Odd that the user of all that CPU isn't showing up though. > >>>> > >>>>> > >>> > >>> As I wrote I use Ubuntu trusty, the exact package versions are: > >>> > >>> corosync 2.3.0-1ubuntu5 > >>> pacemaker 1.1.10+git20130802-1ubuntu2 > >> > >> Ah sorry, I seem to have missed that part. > >> > >>> > >>> There are no updates available. The only option is to install from > >>> sources, > >>
Re: [Pacemaker] Pacemaker/corosync freeze
On 12 Mar 2014, at 1:54 am, Attila Megyeri wrote: >> >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Tuesday, March 11, 2014 12:48 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri >> wrote: >> >>> Thanks for the quick response! >>> >>>> -Original Message- >>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>> Sent: Friday, March 07, 2014 3:48 AM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> >>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> We have a strange issue with Corosync/Pacemaker. >>>>> From time to time, something unexpected happens and suddenly the >>>> crm_mon output remains static. >>>>> When I check the cpu usage, I see that one of the cores uses 100% >>>>> cpu, but >>>> cannot actually match it to either the corosync or one of the >>>> pacemaker processes. >>>>> >>>>> In such a case, this high CPU usage is happening on all 7 nodes. >>>>> I have to manually go to each node, stop pacemaker, restart >>>>> corosync, then >>>> start pacemeker. Stoping pacemaker and corosync does not work in most >>>> of the cases, usually a kill -9 is needed. >>>>> >>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>>>> >>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. >>>>> >>>>> Logs are usually flooded with CPG related messages, such as: >>>>> >>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>> Sent 0 >> CPG >>>> messages (1 remaining, last=8): Try again (6) >>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>> Sent 0 >> CPG >>>> messages (1 remaining, last=8): Try again (6) >>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>> Sent 0 >> CPG >>>> messages (1 remaining, last=8): Try again (6) >>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>> Sent 0 >> CPG >>>> messages (1 remaining, last=8): Try again (6) >>>>> >>>>> OR >>>>> >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>>>> Sent 0 CPG >>>> messages (1 remaining, last=10933): Try again ( >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>>>> Sent 0 CPG >>>> messages (1 remaining, last=10933): Try again ( >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>>>> Sent 0 CPG >>>> messages (1 remaining, last=10933): Try again ( >>>> >>>> That is usually a symptom of corosync getting into a horribly confused >> state. >>>> Version? Distro? Have you checked for an update? >>>> Odd that the user of all that CPU isn't showing up though. >>>> >>>>> >>> >>> As I wrote I use Ubuntu trusty, the exact package versions are: >>> >>> corosync 2.3.0-1ubuntu5 >>> pacemaker 1.1.10+git20130802-1ubuntu2 >> >> Ah sorry, I seem to have missed that part. >> >>> >>> There are no updates available. The only option is to install from sources, >> but that would be very difficult to maintain and I'm not sure I would get >> rid of >> this issue. >>> >>> What do you recommend? >> >> The same thing as Lars, or switch to a distro that stays current with >> upstream >> (git shows 5 newer releases for that branch since it was released 3 years >> ago). >> If you do build from source, its probably best to go with v1.4.6 > > Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. > which was released approx. a year ago (you mention 3 years) and you recommend > 1.4.6, which is a rather old version. > Could you please clarify a bit? :) > Lars recommends 2.3.3 git
Re: [Pacemaker] Pacemaker/corosync freeze
> -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 11, 2014 12:48 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 7 Mar 2014, at 5:54 pm, Attila Megyeri > wrote: > > > Thanks for the quick response! > > > >> -Original Message- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Friday, March 07, 2014 3:48 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 7 Mar 2014, at 5:31 am, Attila Megyeri > >> wrote: > >> > >>> Hello, > >>> > >>> We have a strange issue with Corosync/Pacemaker. > >>> From time to time, something unexpected happens and suddenly the > >> crm_mon output remains static. > >>> When I check the cpu usage, I see that one of the cores uses 100% > >>> cpu, but > >> cannot actually match it to either the corosync or one of the > >> pacemaker processes. > >>> > >>> In such a case, this high CPU usage is happening on all 7 nodes. > >>> I have to manually go to each node, stop pacemaker, restart > >>> corosync, then > >> start pacemeker. Stoping pacemaker and corosync does not work in most > >> of the cases, usually a kill -9 is needed. > >>> > >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > >>> > >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. > >>> > >>> Logs are usually flooded with CPG related messages, such as: > >>> > >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>> Sent 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>> Sent 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>> Sent 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>> Sent 0 > CPG > >> messages (1 remaining, last=8): Try again (6) > >>> > >>> OR > >>> > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>> Sent 0 CPG > >> messages (1 remaining, last=10933): Try again ( > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>> Sent 0 CPG > >> messages (1 remaining, last=10933): Try again ( > >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > >>> Sent 0 CPG > >> messages (1 remaining, last=10933): Try again ( > >> > >> That is usually a symptom of corosync getting into a horribly confused > state. > >> Version? Distro? Have you checked for an update? > >> Odd that the user of all that CPU isn't showing up though. > >> > >>> > > > > As I wrote I use Ubuntu trusty, the exact package versions are: > > > > corosync 2.3.0-1ubuntu5 > > pacemaker 1.1.10+git20130802-1ubuntu2 > > Ah sorry, I seem to have missed that part. > > > > > There are no updates available. The only option is to install from sources, > but that would be very difficult to maintain and I'm not sure I would get rid > of > this issue. > > > > What do you recommend? > > The same thing as Lars, or switch to a distro that stays current with upstream > (git shows 5 newer releases for that branch since it was released 3 years > ago). > If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, which was released approx. a year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old version. Could you please clarify a bit? :) Lars recommends 2.3.3 git tree. I might end up trying both, but just want to make sure I am not misunderstanding something badly. Thank you! > > > > > > >>> > >>> HTOP show something like this (sorted by TIME+ descending): > >>> > >>> > >>> > >>> 1 [100.0%] Tasks: 59, 4 > >> thr; 2 running > >>> 2 [| 0.7%] Load average
Re: [Pacemaker] Pacemaker/corosync freeze
On 7 Mar 2014, at 5:54 pm, Attila Megyeri wrote: > Thanks for the quick response! > >> -Original Message- >> From: Andrew Beekhof [mailto:and...@beekhof.net] >> Sent: Friday, March 07, 2014 3:48 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> >> On 7 Mar 2014, at 5:31 am, Attila Megyeri >> wrote: >> >>> Hello, >>> >>> We have a strange issue with Corosync/Pacemaker. >>> From time to time, something unexpected happens and suddenly the >> crm_mon output remains static. >>> When I check the cpu usage, I see that one of the cores uses 100% cpu, but >> cannot actually match it to either the corosync or one of the pacemaker >> processes. >>> >>> In such a case, this high CPU usage is happening on all 7 nodes. >>> I have to manually go to each node, stop pacemaker, restart corosync, then >> start pacemeker. Stoping pacemaker and corosync does not work in most of >> the cases, usually a kill -9 is needed. >>> >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>> >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. >>> >>> Logs are usually flooded with CPG related messages, such as: >>> >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=8): Try again (6) >>> >>> OR >>> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=10933): Try again ( >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=10933): Try again ( >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (1 remaining, last=10933): Try again ( >> >> That is usually a symptom of corosync getting into a horribly confused state. >> Version? Distro? Have you checked for an update? >> Odd that the user of all that CPU isn't showing up though. >> >>> > > As I wrote I use Ubuntu trusty, the exact package versions are: > > corosync 2.3.0-1ubuntu5 > pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. > > There are no updates available. The only option is to install from sources, > but that would be very difficult to maintain and I'm not sure I would get rid > of this issue. > > What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 > > >>> >>> HTOP show something like this (sorted by TIME+ descending): >>> >>> >>> >>> 1 [100.0%] Tasks: 59, 4 >> thr; 2 running >>> 2 [| 0.7%] Load average: 1.00 >>> 0.99 1.02 >>> Mem[ 165/994MB] Uptime: 1 >> day, 10:22:03 >>> Swp[ 0/509MB] >>> >>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command >>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 >>> /usr/sbin/corosync >>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 >>> /usr/sbin/snmpd - >> Lsd -Lf /dev/null -u snmp -g snm >>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 >> /usr/lib/pacemaker/cib >>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 >> /usr/lib/pacemaker/stonithd >>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 >>> /usr/sbin/watchdog >>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 >> /usr/lib/pacemaker/crmd >>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 >> /usr/lib/pacemaker/lrmd >
Re: [Pacemaker] Pacemaker/corosync freeze
On 2014-03-07T09:08:41, Attila Megyeri wrote: > One more thing to add. I did an apt-get upgrade on one of the nodes, and then > restarted the node. It resulted in this state on all other nodes again... 2.3.0 is not the most recent corosync version. 2.3.3 (and possibly the git tree) contain quite a number of important fixes. I'd suggest to ask Ubuntu for an update - or to submit one yourself, community distributions welcome contributors ;-) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
One more thing to add. I did an apt-get upgrade on one of the nodes, and then restarted the node. It resulted in this state on all other nodes again... > -Original Message- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Friday, March 07, 2014 7:54 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > Thanks for the quick response! > > > -Original Message- > > From: Andrew Beekhof [mailto:and...@beekhof.net] > > Sent: Friday, March 07, 2014 3:48 AM > > To: The Pacemaker cluster resource manager > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > > > > On 7 Mar 2014, at 5:31 am, Attila Megyeri > > wrote: > > > > > Hello, > > > > > > We have a strange issue with Corosync/Pacemaker. > > > From time to time, something unexpected happens and suddenly the > > crm_mon output remains static. > > > When I check the cpu usage, I see that one of the cores uses 100% > > > cpu, but > > cannot actually match it to either the corosync or one of the > > pacemaker processes. > > > > > > In such a case, this high CPU usage is happening on all 7 nodes. > > > I have to manually go to each node, stop pacemaker, restart > > > corosync, then > > start pacemeker. Stoping pacemaker and corosync does not work in most > > of the cases, usually a kill -9 is needed. > > > > > > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > > > > > > Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. > > > > > > Logs are usually flooded with CPG related messages, such as: > > > > > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > > > Sent 0 > CPG > > messages (1 remaining, last=8): Try again (6) > > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > > > Sent 0 > CPG > > messages (1 remaining, last=8): Try again (6) > > > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > > > Sent 0 > CPG > > messages (1 remaining, last=8): Try again (6) > > > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > > > Sent 0 > CPG > > messages (1 remaining, last=8): Try again (6) > > > > > > OR > > > > > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > > Sent 0 CPG > > messages (1 remaining, last=10933): Try again ( > > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > > Sent 0 CPG > > messages (1 remaining, last=10933): Try again ( > > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > > Sent 0 CPG > > messages (1 remaining, last=10933): Try again ( > > > > That is usually a symptom of corosync getting into a horribly confused > > state. > > Version? Distro? Have you checked for an update? > > Odd that the user of all that CPU isn't showing up though. > > > > > > > As I wrote I use Ubuntu trusty, the exact package versions are: > > corosync 2.3.0-1ubuntu5 > pacemaker 1.1.10+git20130802-1ubuntu2 > > There are no updates available. The only option is to install from sources, > but > that would be very difficult to maintain and I'm not sure I would get rid of > this > issue. > > What do you recommend? > > > > > > > > HTOP show something like this (sorted by TIME+ descending): > > > > > > > > > > > > 1 [100.0%] Tasks: 59, 4 > > thr; 2 running > > > 2 [| 0.7%] Load average: > > > 1.00 0.99 1.02 > > > Mem[ 165/994MB] Uptime: 1 > > day, 10:22:03 > > > Swp[ 0/509MB] > > > > > > PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command > > > 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 > /usr/sbin/corosync > > > 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 > > > /usr/sbin/snmpd - > > Lsd -Lf /dev/null -u snmp -g snm > > > 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 > > /usr/lib/pacemaker/cib > > > 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 > > /usr/lib/pacemaker/stonithd > > > 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 &
Re: [Pacemaker] Pacemaker/corosync freeze
Thanks for the quick response! > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Friday, March 07, 2014 3:48 AM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 7 Mar 2014, at 5:31 am, Attila Megyeri > wrote: > > > Hello, > > > > We have a strange issue with Corosync/Pacemaker. > > From time to time, something unexpected happens and suddenly the > crm_mon output remains static. > > When I check the cpu usage, I see that one of the cores uses 100% cpu, but > cannot actually match it to either the corosync or one of the pacemaker > processes. > > > > In such a case, this high CPU usage is happening on all 7 nodes. > > I have to manually go to each node, stop pacemaker, restart corosync, then > start pacemeker. Stoping pacemaker and corosync does not work in most of > the cases, usually a kill -9 is needed. > > > > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > > > > Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. > > > > Logs are usually flooded with CPG related messages, such as: > > > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > > Sent 0 CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > > Sent 0 CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > > Sent 0 CPG > messages (1 remaining, last=8): Try again (6) > > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > > Sent 0 CPG > messages (1 remaining, last=8): Try again (6) > > > > OR > > > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > Sent 0 CPG > messages (1 remaining, last=10933): Try again ( > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > Sent 0 CPG > messages (1 remaining, last=10933): Try again ( > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: > > Sent 0 CPG > messages (1 remaining, last=10933): Try again ( > > That is usually a symptom of corosync getting into a horribly confused state. > Version? Distro? Have you checked for an update? > Odd that the user of all that CPU isn't showing up though. > > > As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? > > > > HTOP show something like this (sorted by TIME+ descending): > > > > > > > > 1 [100.0%] Tasks: 59, 4 > thr; 2 running > > 2 [| 0.7%] Load average: > > 1.00 0.99 1.02 > > Mem[ 165/994MB] Uptime: 1 > day, 10:22:03 > > Swp[ 0/509MB] > > > > PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command > > 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 > > /usr/sbin/corosync > > 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 > > /usr/sbin/snmpd - > Lsd -Lf /dev/null -u snmp -g snm > > 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 > /usr/lib/pacemaker/cib > > 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 > /usr/lib/pacemaker/stonithd > > 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 > > /usr/sbin/watchdog > > 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 > /usr/lib/pacemaker/crmd > > 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 > /usr/lib/pacemaker/lrmd > > 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 > /usr/lib/pacemaker/attrd > > 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd > > 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read > > process > > 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 > /usr/lib/pacemaker/pengine > > 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: > > write process > > 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 > > /usr/sbin/ntpd -p > /var/run/ntpd.pid -g -u 105:112 > > 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 > &
Re: [Pacemaker] Pacemaker/corosync freeze
On 7 Mar 2014, at 5:31 am, Attila Megyeri wrote: > Hello, > > We have a strange issue with Corosync/Pacemaker. > From time to time, something unexpected happens and suddenly the crm_mon > output remains static. > When I check the cpu usage, I see that one of the cores uses 100% cpu, but > cannot actually match it to either the corosync or one of the pacemaker > processes. > > In such a case, this high CPU usage is happening on all 7 nodes. > I have to manually go to each node, stop pacemaker, restart corosync, then > start pacemeker. Stoping pacemaker and corosync does not work in most of the > cases, usually a kill -9 is needed. > > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > > Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. > > Logs are usually flooded with CPG related messages, such as: > > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent > 0 CPG messages (1 remaining, last=8): Try again (6) > Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent > 0 CPG messages (1 remaining, last=8): Try again (6) > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent > 0 CPG messages (1 remaining, last=8): Try again (6) > Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent > 0 CPG messages (1 remaining, last=8): Try again (6) > > OR > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent > 0 CPG messages (1 remaining, last=10933): Try again ( > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent > 0 CPG messages (1 remaining, last=10933): Try again ( > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent > 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. > > > HTOP show something like this (sorted by TIME+ descending): > > > > 1 [100.0%] Tasks: 59, 4 thr; 2 > running > 2 [| 0.7%] Load average: 1.00 > 0.99 1.02 > Mem[ 165/994MB] Uptime: 1 day, > 10:22:03 > Swp[ 0/509MB] > > PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command > 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 > /usr/sbin/corosync > 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd > -Lsd -Lf /dev/null -u snmp -g snm > 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 > /usr/lib/pacemaker/cib > 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 > /usr/lib/pacemaker/stonithd > 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 > /usr/sbin/watchdog > 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 > /usr/lib/pacemaker/crmd > 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 > /usr/lib/pacemaker/lrmd > 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 > /usr/lib/pacemaker/attrd > 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd > 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read > process > 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 > /usr/lib/pacemaker/pengine > 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write > process > 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd > -p /var/run/ntpd.pid -g -u 105:112 > 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 > /usr/sbin/irqbalance > 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit > -c /etc/monit/monitrc > 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 > /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > 3079 root0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop > -a -w /var/log/atop/atop_20140306 6 > 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd > 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 > /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init > 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd > 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd > 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38 > /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38 > /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37 > /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop > 4367 kamailio 20 0