> -----Original Message----- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Tuesday, March 11, 2014 10:27 PM > To: The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] Pacemaker/corosync freeze > > > On 12 Mar 2014, at 1:54 am, Attila Megyeri <amegy...@minerva-soft.com> > wrote: > > >> > >> -----Original Message----- > >> From: Andrew Beekhof [mailto:and...@beekhof.net] > >> Sent: Tuesday, March 11, 2014 12:48 AM > >> To: The Pacemaker cluster resource manager > >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >> > >> > >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri <amegy...@minerva-soft.com> > >> wrote: > >> > >>> Thanks for the quick response! > >>> > >>>> -----Original Message----- > >>>> From: Andrew Beekhof [mailto:and...@beekhof.net] > >>>> Sent: Friday, March 07, 2014 3:48 AM > >>>> To: The Pacemaker cluster resource manager > >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze > >>>> > >>>> > >>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri > >>>> <amegy...@minerva-soft.com> > >>>> wrote: > >>>> > >>>>> Hello, > >>>>> > >>>>> We have a strange issue with Corosync/Pacemaker. > >>>>> From time to time, something unexpected happens and suddenly the > >>>> crm_mon output remains static. > >>>>> When I check the cpu usage, I see that one of the cores uses 100% > >>>>> cpu, but > >>>> cannot actually match it to either the corosync or one of the > >>>> pacemaker processes. > >>>>> > >>>>> In such a case, this high CPU usage is happening on all 7 nodes. > >>>>> I have to manually go to each node, stop pacemaker, restart > >>>>> corosync, then > >>>> start pacemeker. Stoping pacemaker and corosync does not work in > >>>> most of the cases, usually a kill -9 is needed. > >>>>> > >>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. > >>>>> > >>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. > >>>>> > >>>>> Logs are usually flooded with CPG related messages, such as: > >>>>> > >>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>> Sent 0 > >> CPG > >>>> messages (1 remaining, last=8): Try again (6) > >>>>> Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>> Sent 0 > >> CPG > >>>> messages (1 remaining, last=8): Try again (6) > >>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>> Sent 0 > >> CPG > >>>> messages (1 remaining, last=8): Try again (6) > >>>>> Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: > >>>>> Sent 0 > >> CPG > >>>> messages (1 remaining, last=8): Try again (6) > >>>>> > >>>>> OR > >>>>> > >>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: > >>>>> Sent 0 > CPG > >>>> messages (1 remaining, last=10933): Try again ( > >>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: > >>>>> Sent 0 > CPG > >>>> messages (1 remaining, last=10933): Try again ( > >>>>> Mar 06 17:46:24 [1341] ctdb1 cib: info: crm_cs_flush: > >>>>> Sent 0 > CPG > >>>> messages (1 remaining, last=10933): Try again ( > >>>> > >>>> That is usually a symptom of corosync getting into a horribly > >>>> confused > >> state. > >>>> Version? Distro? Have you checked for an update? > >>>> Odd that the user of all that CPU isn't showing up though. > >>>> > >>>>> > >>> > >>> As I wrote I use Ubuntu trusty, the exact package versions are: > >>> > >>> corosync 2.3.0-1ubuntu5 > >>> pacemaker 1.1.10+git20130802-1ubuntu2 > >> > >> Ah sorry, I seem to have missed that part. > >> > >>> > >>> There are no updates available. The only option is to install from > >>> sources, > >> but that would be very difficult to maintain and I'm not sure I would > >> get rid of this issue. > >>> > >>> What do you recommend? > >> > >> The same thing as Lars, or switch to a distro that stays current with > >> upstream (git shows 5 newer releases for that branch since it was > >> released 3 years ago). > >> If you do build from source, its probably best to go with v1.4.6 > > > > Hm, I am a bit confused here. We are using 2.3.0, > > I swapped the 2 for a 1 somehow. A bit distracted, sorry.
I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2 cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting? Thank you in advance. > > > which was released approx. a year ago (you mention 3 years) and you > recommend 1.4.6, which is a rather old version. > > Could you please clarify a bit? :) > > Lars recommends 2.3.3 git tree. > > > > I might end up trying both, but just want to make sure I am not > misunderstanding something badly. > > > > Thank you! > > > > > > > > > > > > > > > > > >> > >>> > >>> > >>>>> > >>>>> HTOP show something like this (sorted by TIME+ descending): > >>>>> > >>>>> > >>>>> > >>>>> 1 [||||||||||||||||||||||||||||||||||||||||100.0%] Tasks: 59, > 4 > >>>> thr; 2 running > >>>>> 2 [| 0.7%] Load average: > >>>>> 1.00 0.99 1.02 > >>>>> Mem[|||||||||||||||||||||||||||||||| 165/994MB] Uptime: 1 > >>>> day, 10:22:03 > >>>>> Swp[ 0/509MB] > >>>>> > >>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command > >>>>> 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 > >> /usr/sbin/corosync > >>>>> 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 > /usr/sbin/snmpd > >> - > >>>> Lsd -Lf /dev/null -u snmp -g snm > >>>>> 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 > >>>> /usr/lib/pacemaker/cib > >>>>> 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 > >>>> /usr/lib/pacemaker/stonithd > >>>>> 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 > /usr/sbin/watchdog > >>>>> 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 > >>>> /usr/lib/pacemaker/crmd > >>>>> 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 > >>>> /usr/lib/pacemaker/lrmd > >>>>> 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 > >>>> /usr/lib/pacemaker/attrd > >>>>> 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd > >>>>> 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: > >>>>> read > >> process > >>>>> 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 > >>>> /usr/lib/pacemaker/pengine > >>>>> 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: > >>>>> write > >> process > >>>>> 1835 ntp 20 0 27216 1980 1408 S 0.0 0.2 0:11.80 > >>>>> /usr/sbin/ntpd - > p > >>>> /var/run/ntpd.pid -g -u 105:112 > >>>>> 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 > /usr/sbin/irqbalance > >>>>> 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 > >>>>> /usr/bin/monit -c > >>>> /etc/monit/monitrc > >>>>> 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 > >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > >>>>> 3079 root 0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 > >>>>> /usr/bin/atop -a > - > >> w > >>>> /var/log/atop/atop_20140306 6 > >>>>> 445 syslog 20 0 249M 6276 976 S 0.0 0.6 0:01.16 rsyslogd > >>>>> 4373 kamailio 20 0 291M 7492 2396 S 0.0 0.7 0:01.03 > >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > >>>>> 1 root 20 0 33376 2632 1404 S 0.0 0.3 0:00.63 /sbin/init > >>>>> 453 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.63 rsyslogd > >>>>> 451 syslog 20 0 249M 6276 976 S 0.0 0.6 0:00.53 rsyslogd > >>>>> 4379 kamailio 20 0 291M 6224 1132 S 0.0 0.6 0:00.38 > >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > >>>>> 4380 kamailio 20 0 291M 8516 3084 S 0.0 0.8 0:00.38 > >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > >>>>> 4381 kamailio 20 0 291M 8252 2828 S 0.0 0.8 0:00.37 > >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > >>>>> 23315 root 20 0 24872 2476 1412 R 0.7 0.2 0:00.37 htop > >>>>> 4367 kamailio 20 0 291M 10000 4864 S 0.0 1.0 0:00.36 > >>>> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili > >>>>> > >>>>> > >>>>> My questions: > >>>>> - Is this a cororync or pacameker issue? > >>>>> - What are the CPG messages? Is it possible that we have a firewall > >> issue? > >>>>> > >>>>> > >>>>> Any hints would be great! > >>>>> > >>>>> Thanks, > >>>>> Attila > >>>>> _______________________________________________ > >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>>>> > >>>>> Project Home: http://www.clusterlabs.org Getting started: > >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>>>> Bugs: http://bugs.clusterlabs.org > >>> > >>> > >>> _______________________________________________ > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>> > >>> Project Home: http://www.clusterlabs.org Getting started: > >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org