Re: [Pacemaker] Pacemaker/corosync freeze
Hi Attila, Did you try compiling libqb 0.17.0 ? Wondering if that solved your issue ? I also have the same issue. Please suggest if you already solved it. Thanks Sreenivasa ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hello, -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 18, 2014 2:43 AM To: Attila Megyeri Cc: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Did you run the command? What did it say? Yes, all was fine. This is why I found it strange. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
On 18 Mar 2014, at 6:03 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 18, 2014 2:43 AM To: Attila Megyeri Cc: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Did you run the command? What did it say? Yes, all was fine. This is why I found it strange. If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then I should be able to figure out what it was complaining about. (You can also run: crm_verify --xml-file /var/lib/pacemaker/pengine/pe-error-7.bz2 -V ) signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hi Andrew, -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 18, 2014 11:40 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 18 Mar 2014, at 6:03 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 18, 2014 2:43 AM To: Attila Megyeri Cc: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegyeri@minerva- soft.com wrote: Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Did you run the command? What did it say? Yes, all was fine. This is why I found it strange. If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then I should be able to figure out what it was complaining about. (You can also run: crm_verify --xml-file /var/lib/pacemaker/pengine/pe- error-7.bz2 -V ) The file is still there, and crm_veryfy check is successful (error 0) and no output. The file is full of confidential data but if you think you can find something useful in it I can send it in a direct mail. thanks! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hi David, Jan, For the time being corosync 2.3.3 looks stable with libqb 0.17.0 with both build from source. Thank you very much for the guidance! Attila -Original Message- From: David Vossel [mailto:dvos...@redhat.com] Sent: Thursday, March 13, 2014 9:22 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze - Original Message - From: Jan Friesse jfrie...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 4:03:28 AM Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) yes, there was a libqb/corosync interoperation problem that showed these same symptoms last year. Updating to the latest corosync and libqb will likely resolve this. - And maybe also newer pacemaker I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Did you run the command? What did it say? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hello David, -Original Message- From: David Vossel [mailto:dvos...@redhat.com] Sent: Thursday, March 13, 2014 9:22 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze - Original Message - From: Jan Friesse jfrie...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 4:03:28 AM Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) yes, there was a libqb/corosync interoperation problem that showed these same symptoms last year. Updating to the latest corosync and libqb will likely resolve this. I have upgraded all nodes to these version and we are testing. So far no issues. Thank you very much for your help. Regards, Attila - And maybe also newer pacemaker I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hi Honza, What I also found in the log related to the freeze at 12:22:26: Corosync main process was not scheduled for ... Can It be the general cause of the issue? Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647-[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: Regards, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 1:45 PM To: The Pacemaker cluster resource manager; Andrew Beekhof Subject: Re: [Pacemaker] Pacemaker/corosync freeze Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify - L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr In the meantime I will install the new libqb and send logs if we have further issues. Thank you very much for your help! Regards, Attila One more question: If I install libqb 0.17.0 from
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hi Honza, What I also found in the log related to the freeze at 12:22:26: Corosync main process was not scheduled for ... Can It be the general cause of the issue? I don't think it will cause issue you are hitting BUT keep in mind that if corosync is not scheduled for long time, it's probably fenced by other node. So increase timeout is vital. Honza Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647-[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: Regards, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 1:45 PM To: The Pacemaker cluster resource manager; Andrew Beekhof Subject: Re: [Pacemaker] Pacemaker/corosync freeze Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify - L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr
Re: [Pacemaker] Pacemaker/corosync freeze
... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr In the meantime I will install the new libqb and send logs if we have further issues. Thank you very much for your help! Regards, Attila I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 1:45 PM To: The Pacemaker cluster resource manager; Andrew Beekhof Subject: Re: [Pacemaker] Pacemaker/corosync freeze Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify - L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr In the meantime I will install the new libqb and send logs if we have further issues. Thank you very much for your help! Regards, Attila One more question: If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, or if it was built with libqb 0.16.0 it will be fine? BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can see if it makes a difference. If I see crashes on the outdated ones, but not on the new ones, we are fine. :) Thanks, Attila I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Hi Honza, What I also found in the log related to the freeze at 12:22:26: Corosync main process was not scheduled for ... Can It be the general cause of the issue? Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647-[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: Regards, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 1:45 PM To: The Pacemaker cluster resource manager; Andrew Beekhof Subject: Re: [Pacemaker] Pacemaker/corosync freeze Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify - L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr In the meantime I will install the new libqb and send logs if we have further issues. Thank you very much
Re: [Pacemaker] Pacemaker/corosync freeze
- Original Message - From: Jan Friesse jfrie...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 4:03:28 AM Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) yes, there was a libqb/corosync interoperation problem that showed these same symptoms last year. Updating to the latest corosync and libqb will likely resolve this. - And maybe also newer pacemaker I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting? Thank you in advance. which was released approx. a year ago (you mention 3
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Attila, Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting? First of all, 1.x branch (flatiron) is maintained so even it looks like a old version, it's quite a new. It contains more or less only
Re: [Pacemaker] Pacemaker/corosync freeze
Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages
Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd
Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 4:31 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush
Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, which was released approx. a year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old version. Could you please clarify a bit? :) Lars recommends 2.3.3 git tree. I might end up trying both, but just want to make sure I am not misunderstanding something badly. Thank you! HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652
Re: [Pacemaker] Pacemaker/corosync freeze
On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. which was released approx. a year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old version. Could you please clarify a bit? :) Lars recommends 2.3.3 git tree. I might end up trying both, but just want to make sure I am not misunderstanding something badly. Thank you! HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1
Re: [Pacemaker] Pacemaker/corosync freeze
On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 /usr/lib/pacemaker/pengine 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write process 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c /etc/monit/monitrc 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 3079 root0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop -a -w /var/log/atop/atop_20140306 6
Re: [Pacemaker] Pacemaker/corosync freeze
On 2014-03-07T09:08:41, Attila Megyeri amegy...@minerva-soft.com wrote: One more thing to add. I did an apt-get upgrade on one of the nodes, and then restarted the node. It resulted in this state on all other nodes again... 2.3.0 is not the most recent corosync version. 2.3.3 (and possibly the git tree) contain quite a number of important fixes. I'd suggest to ask Ubuntu for an update - or to submit one yourself, community distributions welcome contributors ;-) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
One more thing to add. I did an apt-get upgrade on one of the nodes, and then restarted the node. It resulted in this state on all other nodes again... -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Friday, March 07, 2014 7:54 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 /usr/lib/pacemaker/pengine 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write process 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c /etc/monit/monitrc 4374 kamailio 20 0
[Pacemaker] Pacemaker/corosync freeze
Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 CPG messages (1 remaining, last=10933): Try again ( HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd -Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 /usr/lib/pacemaker/pengine 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write process 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c /etc/monit/monitrc 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 3079 root0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop -a -w /var/log/atop/atop_20140306 6 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop 4367 kamailio 20 0 291M 1 4864 S 0.0 1.0 0:00.36 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili My questions: - Is this a cororync or pacameker issue? - What are the CPG messages? Is it possible that we have a firewall issue? Any hints would be great! Thanks, Attila ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home:
Re: [Pacemaker] Pacemaker/corosync freeze
On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd -Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 /usr/lib/pacemaker/pengine 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write process 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c /etc/monit/monitrc 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 3079 root0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop -a -w /var/log/atop/atop_20140306 6 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop 4367 kamailio 20 0 291M 1 4864 S 0.0 1.0 0:00.36 /usr/local/sbin/kamailio -f
Re: [Pacemaker] Pacemaker/corosync freeze
Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 /usr/lib/pacemaker/pengine 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write process 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c /etc/monit/monitrc 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 3079 root0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop -a -w /var/log/atop/atop_20140306 6 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init 453 syslog 20 0 249M