Attila Megyeri napsal(a): > Hello Jan, > > Thank you very much for your help so far. > >> -----Original Message----- >> From: Jan Friesse [mailto:jfrie...@redhat.com] >> Sent: Wednesday, March 12, 2014 9:51 AM >> To: The Pacemaker cluster resource manager >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >> >> Attila Megyeri napsal(a): >>> >>>> -----Original Message----- >>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>> Sent: Tuesday, March 11, 2014 10:27 PM >>>> To: The Pacemaker cluster resource manager >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>> >>>> >>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri >>>> <amegy...@minerva-soft.com> >>>> wrote: >>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>> Sent: Tuesday, March 11, 2014 12:48 AM >>>>>> To: The Pacemaker cluster resource manager >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>> >>>>>> >>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri >>>>>> <amegy...@minerva-soft.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the quick response! >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net] >>>>>>>> Sent: Friday, March 07, 2014 3:48 AM >>>>>>>> To: The Pacemaker cluster resource manager >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze >>>>>>>> >>>>>>>> >>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri >>>>>>>> <amegy...@minerva-soft.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> We have a strange issue with Corosync/Pacemaker. >>>>>>>>> From time to time, something unexpected happens and suddenly >> the >>>>>>>> crm_mon output remains static. >>>>>>>>> When I check the cpu usage, I see that one of the cores uses >>>>>>>>> 100% cpu, but >>>>>>>> cannot actually match it to either the corosync or one of the >>>>>>>> pacemaker processes. >>>>>>>>> >>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes. >>>>>>>>> I have to manually go to each node, stop pacemaker, restart >>>>>>>>> corosync, then >>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work in >>>>>>>> most of the cases, usually a kill -9 is needed. >>>>>>>>> >>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. >>>>>>>>> >>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode >> passive. >>>>>>>>> >>>>>>>>> Logs are usually flooded with CPG related messages, such as: >>>>>>>>> >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>>>> Sent >> 0 >>>>>> CPG >>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>>>> Sent >> 0 >>>>>> CPG >>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>>>> Sent >> 0 >>>>>> CPG >>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: >>>>>>>>> Sent >> 0 >>>>>> CPG >>>>>>>> messages (1 remaining, last=8): Try again (6) >>>>>>>>> >>>>>>>>> OR >>>>>>>>> >>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >>>>>>>>> Sent 0 >>>> CPG >>>>>>>> messages (1 remaining, last=10933): Try again ( >>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >>>>>>>>> Sent 0 >>>> CPG >>>>>>>> messages (1 remaining, last=10933): Try again ( >>>>>>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: >>>>>>>>> Sent 0 >>>> CPG >>>>>>>> messages (1 remaining, last=10933): Try again ( >>>>>>>> >>>>>>>> That is usually a symptom of corosync getting into a horribly >>>>>>>> confused >>>>>> state. >>>>>>>> Version? Distro? Have you checked for an update? >>>>>>>> Odd that the user of all that CPU isn't showing up though. >>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> As I wrote I use Ubuntu trusty, the exact package versions are: >>>>>>> >>>>>>> corosync 2.3.0-1ubuntu5 >>>>>>> pacemaker 1.1.10+git20130802-1ubuntu2 >>>>>> >>>>>> Ah sorry, I seem to have missed that part. >>>>>> >>>>>>> >>>>>>> There are no updates available. The only option is to install from >>>>>>> sources, >>>>>> but that would be very difficult to maintain and I'm not sure I >>>>>> would get rid of this issue. >>>>>>> >>>>>>> What do you recommend? >>>>>> >>>>>> The same thing as Lars, or switch to a distro that stays current >>>>>> with upstream (git shows 5 newer releases for that branch since it >>>>>> was released 3 years ago). >>>>>> If you do build from source, its probably best to go with v1.4.6 >>>>> >>>>> Hm, I am a bit confused here. We are using 2.3.0, >>>> >>>> I swapped the 2 for a 1 somehow. A bit distracted, sorry. >>> >>> I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still >>> the >> same issue - after some time CPU gets to 100%, and the corosync log is >> flooded with messages like: >>> >>> Mar 12 07:36:55 [4793] ctdb2 cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (48 remaining, last=3671): Try again (6) >>> Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (51 remaining, last=3995): Try again (6) >>> Mar 12 07:36:56 [4793] ctdb2 cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (48 remaining, last=3671): Try again (6) >>> Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (51 remaining, last=3995): Try again (6) >>> Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (48 remaining, last=3671): Try again (6) >>> Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: >>> Sent 0 CPG >> messages (51 remaining, last=3995): Try again (6) >>> Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: >>> Sent 0 CPG >> messages (48 remaining, last=3671): Try again (6) >>> >>> >> >> Attila, >> >>> Shall I try to downgrade to 1.4.6? What is the difference in that build? Or >> where should I start troubleshooting? >> >> First of all, 1.x branch (flatiron) is maintained so even it looks like a old >> version, it's quite a new. It contains more or less only bugfixes. >> > > OK - The next thing I will try will be to downgrade to 1.4.6 if the > troubleshooting does not bring us closer. > Actually we have a couple of clusters running 1.4.2, but stack is "openais" > not corosync. Currently we use "corosync". > > >> 2.x branch (needle) contains not only bugfixes but also new features. >> >> Keep in mind that with 1.x you need to use cman as quorum provider (2.x >> contains quorum in base). >> >> There are no big differences in build. >> >> But back to your original question. Of course troubleshooting is always >> better. >> >> Try again error (6) is happening when corosync is in sync state. This is >> happening when NEW node is discovered, there is network split/merge and >> usually takes only few milliseconds. Usually problem you are hitting is >> caused >> by some network issue. > > I can confirm this. The 100% cpu issue happens when I restart one of the > nodes. It seems that it is happening when a given node comes backup up and a > new membership is about to be formed. > > >> >> So first of all take a look to corosync.log (/var/log/cluster/corosync.log). >> Do >> you see some warning/error there? > > Not really. I reproduced a case so you can see for yourself. > Initially I had a stable cluster. > At 10:42:39 I did a reboot on the "ctsip1" node. All was fine until the node > came back up (around 10:43:00). At this point, the cpu usage went to 100% and > corosync stopped working properly. > > here is the relevant corosync.log: http://pastebin.com/HJENEdZj >
Is that log file somehow continue? I mean, interesting is: Mar 12 10:43:00 [973] ctdb2 corosync notice [TOTEM ] A new membership (10.9.1.3:1592) was formed. Members joined: 168362281 What means new membership and sync is now running but there is no counterpart (which looks like: Mar 12 10:42:40 [973] ctdb2 corosync notice [MAIN ] Completed service synchronization, ready to provide service. ). Can you please try remove rrp and use pure srp? Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? Regards, Honza > > >> >> What transport are you using? Multicast (udp) or unicast (udpu)? >> >> Can you please paste your corosync.conf? > > We use udpu, since the servers are in different subnets and multicast did not > work as expected. (In our other systems we use multicast). > > The corosync.conf is at: http://pastebin.com/dMivQJn5 > > > Thank you in advance, > > Regards, > Attila > > >> >> Regards, >> Honza >> >>> >>> Thank you in advance. >>> >>> >>> >>> >>> >>> >>>> >>>>> which was released approx. a year ago (you mention 3 years) and you >>>> recommend 1.4.6, which is a rather old version. >>>>> Could you please clarify a bit? :) >>>>> Lars recommends 2.3.3 git tree. >>>>> >>>>> I might end up trying both, but just want to make sure I am not >>>> misunderstanding something badly. >>>>> >>>>> Thank you! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> HTOP show something like this (sorted by TIME+ descending): >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 1 [||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: >> 59, >>>> 4 >>>>>>>> thr; 2 running >>>>>>>>> 2 [| 0.7%] Load average: >>>>>>>>> 1.00 0.99 1.02 >>>>>>>>> Mem[|||||||||||||||||||||||||||||||| 165/994MB] >> Uptime: 1 >>>>>>>> day, 10:22:03 >>>>>>>>> Swp[ 0/509MB] >>>>>>>>> >>>>>>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ >> Command >>>>>>>>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 >>>>>> /usr/sbin/corosync >>>>>>>>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 >>>> /usr/sbin/snmpd >>>>>> - >>>>>>>> Lsd -Lf /dev/null -u snmp -g snm >>>>>>>>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 >>>>>>>> /usr/lib/pacemaker/cib >>>>>>>>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 >>>>>>>> /usr/lib/pacemaker/stonithd >>>>>>>>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 >>>> /usr/sbin/watchdog >>>>>>>>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 >>>>>>>> /usr/lib/pacemaker/crmd >>>>>>>>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 >>>>>>>> /usr/lib/pacemaker/lrmd >>>>>>>>> 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 >>>>>>>> /usr/lib/pacemaker/attrd >>>>>>>>> 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 >>>>>>>>> pacemakerd >>>>>>>>> 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 >>>>>>>>> ha_logd: read >>>>>> process >>>>>>>>> 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 >>>>>>>> /usr/lib/pacemaker/pengine >>>>>>>>> 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 >>>>>>>>> ha_logd: write >>>>>> process >>>>>>>>> 1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80 >> /usr/sbin/ntpd - >>>> p >>>>>>>> /var/run/ntpd.pid -g -u 105:112 >>>>>>>>> 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 >>>> /usr/sbin/irqbalance >>>>>>>>> 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 >> /usr/bin/monit -c >>>>>>>> /etc/monit/monitrc >>>>>>>>> 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>> 3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 >>>>>>>>> /usr/bin/atop >> -a >>>> - >>>>>> w >>>>>>>> /var/log/atop/atop_20140306 6 >>>>>>>>> 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd >>>>>>>>> 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>> 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 >>>>>>>>> /sbin/init >>>>>>>>> 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd >>>>>>>>> 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd >>>>>>>>> 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38 >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>> 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38 >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>> 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37 >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>> 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop >>>>>>>>> 4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36 >>>>>>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili >>>>>>>>> >>>>>>>>> >>>>>>>>> My questions: >>>>>>>>> - Is this a cororync or pacameker issue? >>>>>>>>> - What are the CPG messages? Is it possible that we have a firewall >>>>>> issue? >>>>>>>>> >>>>>>>>> >>>>>>>>> Any hints would be great! >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Attila >>>>>>>>> _______________________________________________ >>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org