Attila Megyeri napsal(a): >> -----Original Message----- >> From: Jan Friesse [mailto:jfrie...@redhat.com] >> Sent: Wednesday, March 12, 2014 2:27 PM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> Attila Megyeri napsal(a): >>> Hello Jan, >>> >>> Thank you very much for your help so far. >>> >>>> -----Original Message----- >>>> From: Jan Friesse [mailto:jfrie...@redhat.com] >>>> Sent: Wednesday, March 12, 2014 9:51 AM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> Attila Megyeri napsal(a): >>>>> >>>>>> -----Original Message----- >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>> Sent: Tuesday, March 11, 2014 10:27 PM >>>>>> To: The Pacemaker cluster resource manager >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>> >>>>>> >>>>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri >>>>>> <amegy...@minerva-soft.com> >>>>>> wrote: >>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>>>> Sent: Tuesday, March 11, 2014 12:48 AM >>>>>>>> To: The Pacemaker cluster resource manager >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>>>> >>>>>>>> >>>>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri >>>>>>>> <amegy...@minerva-soft.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the quick response! >>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>>>>>> Sent: Friday, March 07, 2014 3:48 AM >>>>>>>>>> To: The Pacemaker cluster resource manager >>>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri >>>>>>>>>> <amegy...@minerva-soft.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> We have a strange issue with Corosync/Pacemaker. >>>>>>>>>>> From time to time, something unexpected happens and >> suddenly >>>> the >>>>>>>>>> crm_mon output remains static. >>>>>>>>>>> When I check the cpu usage, I see that one of the cores uses >>>>>>>>>>> 100% cpu, but >>>>>>>>>> cannot actually match it to either the corosync or one of the >>>>>>>>>> pacemaker processes. >>>>>>>>>>> >>>>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes. >>>>>>>>>>> I have to manually go to each node, stop pacemaker, restart >>>>>>>>>>> corosync, then >>>>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work >>>>>>>>>> in most of the cases, usually a kill -9 is needed. >>>>>>>>>>> >>>>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>>>>>>>>>> >>>>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode >>>> passive. >>>>>>>>>>> >>>>>>>>>>> Logs are usually flooded with CPG related messages, such as: >>>>>>>>>>> >>>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent >>>> 0 >>>>>>>> CPG >>>>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent >>>> 0 >>>>>>>> CPG >>>>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent >>>> 0 >>>>>>>> CPG >>>>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >> Sent >>>> 0 >>>>>>>> CPG >>>>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>>>> >>>>>>>>>>> OR >>>>>>>>>>> >>>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >> Sent 0 >>>>>> CPG >>>>>>>>>> messages (1 remaining, last=10933): Try again ( >>>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >> Sent 0 >>>>>> CPG >>>>>>>>>> messages (1 remaining, last=10933): Try again ( >>>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >> Sent 0 >>>>>> CPG >>>>>>>>>> messages (1 remaining, last=10933): Try again ( >>>>>>>>>> >>>>>>>>>> That is usually a symptom of corosync getting into a horribly >>>>>>>>>> confused >>>>>>>> state. >>>>>>>>>> Version? Distro? Have you checked for an update? >>>>>>>>>> Odd that the user of all that CPU isn't showing up though. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> As I wrote I use Ubuntu trusty, the exact package versions are: >>>>>>>>> >>>>>>>>> corosync 2.3.0-1ubuntu5 >>>>>>>>> pacemaker 1.1.10+git20130802-1ubuntu2 >>>>>>>> >>>>>>>> Ah sorry, I seem to have missed that part. >>>>>>>> >>>>>>>>> >>>>>>>>> There are no updates available. The only option is to install >>>>>>>>> from sources, >>>>>>>> but that would be very difficult to maintain and I'm not sure I >>>>>>>> would get rid of this issue. >>>>>>>>> >>>>>>>>> What do you recommend? >>>>>>>> >>>>>>>> The same thing as Lars, or switch to a distro that stays current >>>>>>>> with upstream (git shows 5 newer releases for that branch since >>>>>>>> it was released 3 years ago). >>>>>>>> If you do build from source, its probably best to go with v1.4.6 >>>>>>> >>>>>>> Hm, I am a bit confused here. We are using 2.3.0, >>>>>> >>>>>> I swapped the 2 for a 1 somehow. A bit distracted, sorry. >>>>> >>>>> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but >>>>> still the >>>> same issue - after some time CPU gets to 100%, and the corosync log >>>> is flooded with messages like: >>>>> >>>>> Mar 12 07:36:55 [4793] ctdb2 cib: info: crm_cs_flush: >>>>> Sent 0 CPG >>>> messages (48 remaining, last=3671): Try again (6) >>>>> Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: >>>>> Sent 0 >> CPG >>>> messages (51 remaining, last=3995): Try again (6) >>>>> Mar 12 07:36:56 [4793] ctdb2 cib: info: crm_cs_flush: >>>>> Sent 0 CPG >>>> messages (48 remaining, last=3671): Try again (6) >>>>> Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: >>>>> Sent 0 >> CPG >>>> messages (51 remaining, last=3995): Try again (6) >>>>> Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: >>>>> Sent 0 CPG >>>> messages (48 remaining, last=3671): Try again (6) >>>>> Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: >>>>> Sent 0 >> CPG >>>> messages (51 remaining, last=3995): Try again (6) >>>>> Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: >>>>> Sent 0 CPG >>>> messages (48 remaining, last=3671): Try again (6) >>>>> >>>>> >>>> >>>> Attila, >>>> >>>>> Shall I try to downgrade to 1.4.6? What is the difference in that >>>>> build? Or >>>> where should I start troubleshooting? >>>> >>>> First of all, 1.x branch (flatiron) is maintained so even it looks >>>> like a old version, it's quite a new. It contains more or less only >>>> bugfixes. >>>> >>> >>> OK - The next thing I will try will be to downgrade to 1.4.6 if the >> troubleshooting does not bring us closer. >>> Actually we have a couple of clusters running 1.4.2, but stack is "openais" >> not corosync. Currently we use "corosync". >>> >>> >>>> 2.x branch (needle) contains not only bugfixes but also new features. >>>> >>>> Keep in mind that with 1.x you need to use cman as quorum provider >>>> (2.x contains quorum in base). >>>> >>>> There are no big differences in build. >>>> >>>> But back to your original question. Of course troubleshooting is >>>> always better. >>>> >>>> Try again error (6) is happening when corosync is in sync state. This >>>> is happening when NEW node is discovered, there is network >>>> split/merge and usually takes only few milliseconds. Usually problem >>>> you are hitting is caused by some network issue. >>> >>> I can confirm this. The 100% cpu issue happens when I restart one of the >> nodes. It seems that it is happening when a given node comes backup up and >> a new membership is about to be formed. >>> >>> >>>> >>>> So first of all take a look to corosync.log >>>> (/var/log/cluster/corosync.log). Do you see some warning/error there? >>> >>> Not really. I reproduced a case so you can see for yourself. >>> Initially I had a stable cluster. >>> At 10:42:39 I did a reboot on the "ctsip1" node. All was fine until the node >> came back up (around 10:43:00). At this point, the cpu usage went to 100% >> and corosync stopped working properly. >>> >>> here is the relevant corosync.log: http://pastebin.com/HJENEdZj >>> >> >> Is that log file somehow continue? I mean, interesting is: >> Mar 12 10:43:00 [973] ctdb2 corosync notice [TOTEM ] A new membership >> (10.9.1.3:1592) was formed. Members joined: 168362281 > > That was all what I sent. From 43:00 the corosync process appears to be > irresponsive. One of the cores was at 100% cpu, but from htop, top or similar > apps it is impossible to gues which app it is. Of course, killing corosync > with -9 lowers the cpu usage. >
That sounds very bad. >> >> What means new membership and sync is now running but there is no >> counterpart (which looks like: >> >> Mar 12 10:42:40 [973] ctdb2 corosync notice [MAIN ] Completed service >> synchronization, ready to provide service. ). > > >> >> Can you please try remove rrp and use pure srp? > > If I understand correctly, you are asking me to remove the redundant link > (the second interface block from the .conf) completely, right? > Yes, exactly. I don't think it will help but it's at least good to try. > >> >> Also can you please try to set debug: on in corosync.conf and paste full >> corosync.log then? > > I set debug to on, and did a few restarts but could not reproduce the issue > yet - will post the logs as soon as I manage to reproduce. > Perfect. Another option you can try to set is netmtu (1200 is usually safe). Regards, Honza > > There are also a few things that might or might not be related: > > 1) Whenever I want to edit the configuration with "crm configure edit", upon > save I get a similar error: > " ERROR: 47: duplicate element cib-bootstrap-options > Do you want to edit again? " > But there is no such duplicate elment as far as I can tell. This > might be a crmsh issue, and not related to corosync at all, just mentioning. > > 2) > "Mar 11 21:31:11 [4797] ctdb2 pengine: error: > process_pe_message: Calculated Transition 27: > /var/lib/pacemaker/pengine/pe-error-7.bz2 > Mar 11 21:31:11 [4797] ctdb2 pengine: notice: > process_pe_message: Configuration ERRORs found during PE processing. Please > run "crm_verify -L" to identify issues." > > But crm_veryfy -L shows no problems at all... > > > > >> >> Regards, >> Honza >> >>> >>> >>>> >>>> What transport are you using? Multicast (udp) or unicast (udpu)? >>>> >>>> Can you please paste your corosync.conf? >>> >>> We use udpu, since the servers are in different subnets and multicast did >> not work as expected. (In our other systems we use multicast). >>> >>> The corosync.conf is at: http://pastebin.com/dMivQJn5 >>> >>> >>> Thank you in advance, >>> >>> Regards, >>> Attila >>> >>> >>>> >>>> Regards, >>>> Honza >>>> >>>>> >>>>> Thank you in advance. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>>> which was released approx. a year ago (you mention 3 years) and >>>>>>> you >>>>>> recommend 1.4.6, which is a rather old version. >>>>>>> Could you please clarify a bit? :) Lars recommends 2.3.3 git tree. >>>>>>> >>>>>>> I might end up trying both, but just want to make sure I am not >>>>>> misunderstanding something badly. >>>>>>> >>>>>>> Thank you! >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> HTOP show something like this (sorted by TIME+ descending): >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 1 [||||||||||||||||||||||||||||||||||||||||100.0%] >> Tasks: >>>> 59, >>>>>> 4 >>>>>>>>>> thr; 2 running >>>>>>>>>>> 2 [| 0.7%] Load >>>>>>>>>>> average: 1.00 0.99 1.02 >>>>>>>>>>> Mem[|||||||||||||||||||||||||||||||| 165/994MB] >>>> Uptime: 1 >>>>>>>>>> day, 10:22:03 >>>>>>>>>>> Swp[ 0/509MB] >>>>>>>>>>> >>>>>>>>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ >>>> Command >>>>>>>>>>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 >>>>>>>> /usr/sbin/corosync >>>>>>>>>>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 >>>>>> /usr/sbin/snmpd >>>>>>>> - >>>>>>>>>> Lsd -Lf /dev/null -u snmp -g snm >>>>>>>>>>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 >>>>>>>>>> /usr/lib/pacemaker/cib >>>>>>>>>>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 >>>>>>>>>> /usr/lib/pacemaker/stonithd >>>>>>>>>>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 >>>>>> /usr/sbin/watchdog >>>>>>>>>>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 >>>>>>>>>> /usr/lib/pacemaker/crmd >>>>>>>>>>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 >>>>>>>>>> /usr/lib/pacemaker/lrmd >>>>>>>>>>> 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 >>>>>>>>>> /usr/lib/pacemaker/attrd >>>>>>>>>>> 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 >> pacemakerd >>>>>>>>>>> 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 >>>>>>>>>>> ha_logd: >> read >>>>>>>> process >>>>>>>>>>> 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 >>>>>>>>>> /usr/lib/pacemaker/pengine >>>>>>>>>>> 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 >>>>>>>>>>> ha_logd: >> write >>>>>>>> process >>>>>>>>>>> 1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80 >>>> /usr/sbin/ntpd - >>>>>> p >>>>>>>>>> /var/run/ntpd.pid -g -u 105:112 >>>>>>>>>>> 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 >>>>>> /usr/sbin/irqbalance >>>>>>>>>>> 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 >>>> /usr/bin/monit -c >>>>>>>>>> /etc/monit/monitrc >>>>>>>>>>> 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 >>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>>>> 3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 >> /usr/bin/atop >>>> -a >>>>>> - >>>>>>>> w >>>>>>>>>> /var/log/atop/atop_20140306 6 >>>>>>>>>>> 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 >>>>>>>>>>> rsyslogd >>>>>>>>>>> 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 >>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>>>> 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 >>>>>>>>>>> /sbin/init >>>>>>>>>>> 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 >>>>>>>>>>> rsyslogd >>>>>>>>>>> 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 >>>>>>>>>>> rsyslogd >>>>>>>>>>> 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38 >>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>>>> 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38 >>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>>>> 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37 >>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>>>> 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop >>>>>>>>>>> 4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36 >>>>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> My questions: >>>>>>>>>>> - Is this a cororync or pacameker issue? >>>>>>>>>>> - What are the CPG messages? Is it possible that we have a >> firewall >>>>>>>> issue? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Any hints would be great! >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Attila >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>> >>>>>>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org