Re: [Pacemaker] Pacemaker/corosync freeze
One more thing to add. I did an apt-get upgrade on one of the nodes, and then restarted the node. It resulted in this state on all other nodes again... -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Friday, March 07, 2014 7:54 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 /usr/lib/pacemaker/pengine 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write process 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c /etc/monit/monitrc 4374 kamailio 20 0
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On Fri, 07 Mar 2014 10:30:13 +0300 Vladislav Bogdanov bub...@hoster-ok.com wrote: Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? No, that commit fixes an issue when importing the CIB into crmsh, the diff calculation happens when going the other way. It seems strange that crmsh should be causing such a problem, all it does is call crm_diff to generate the actual diff so any problem with an incorrect digest should be coming from crm_diff. I don't think this is an issue that is known to me, it doesn't sound like it is the same problem I have been investigating. Could you file a bug at https://savannah.nongnu.org/bugs/?group=crmsh with some more details? Thank you, -- // Kristoffer Grönlund // kgronl...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff
so I fixed the problem regarding hostname in drbd.conf and in name from cluster point of view. ALso configured and verified fence_vmware agent and enabled stonith Changed in drbd resource configuration resource ovirt { disk { disk-flushes no; md-flushes no; fencing resource-and-stonith; } device minor 0; disk /dev/sdb; syncer { rate 30M; verify-alg md5; } handlers { fence-peer /usr/lib/drbd/crm-fence-peer.sh; after-resync-target /usr/lib/drbd/crm-unfence-peer.sh; } Put in cluster.conf cman expected_votes=1 two_node=1/ and restarted pacemaker and cman on nodes. service active on ovirteng01 I provoke power off of ovirteng01. Fencing agent works ok on ovirteng02 and reboots it. I stop boot ofovirteng01 at grub prompt to simulate problem in boot (for example system put in console mode due to filesystem problem) In the mean time ovirteng02 becomes master of drbd resource, but doesn't start the group This in messages: Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: PingAck did not arrive in time. Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: peer( Primary - Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown ) Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: asender terminated Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: Terminating drbd_a_ovirt Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: Connection closed Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: conn( NetworkFailure - Unconnected ) Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: receiver terminated Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: Restarting receiver thread Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: receiver (re)started Mar 8 01:08:00 ovirteng02 kernel: drbd ovirt: conn( Unconnected - WFConnection ) Mar 8 01:08:02 ovirteng02 corosync[12908]: [TOTEM ] A processor failed, forming new configuration. Mar 8 01:08:04 ovirteng02 corosync[12908]: [QUORUM] Members[1]: 2 Mar 8 01:08:04 ovirteng02 corosync[12908]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 8 01:08:04 ovirteng02 corosync[12908]: [CPG ] chosen downlist: sender r(0) ip(192.168.33.46) ; members(old:2 left:1) Mar 8 01:08:04 ovirteng02 corosync[12908]: [MAIN ] Completed service synchronization, ready to provide service. Mar 8 01:08:04 ovirteng02 kernel: dlm: closing connection to node 1 Mar 8 01:08:04 ovirteng02 crmd[13168]: notice: crm_update_peer_state: cman_event_callback: Node ovirteng01.localdomain.local[1] - state is now lost (was member) Mar 8 01:08:04 ovirteng02 crmd[13168]: warning: reap_dead_nodes: Our DC node (ovirteng01.localdomain.local) left the cluster Mar 8 01:08:04 ovirteng02 crmd[13168]: notice: do_state_transition: State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ] Mar 8 01:08:04 ovirteng02 crmd[13168]: notice: do_state_transition: State transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Mar 8 01:08:04 ovirteng02 fenced[12962]: fencing node ovirteng01.localdomain.local Mar 8 01:08:04 ovirteng02 attrd[13166]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Mar 8 01:08:04 ovirteng02 attrd[13166]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-OvirtData (1) Mar 8 01:08:04 ovirteng02 attrd[13166]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Mar 8 01:08:04 ovirteng02 fence_pcmk[13733]: Requesting Pacemaker fence ovirteng01.localdomain.local (reset) Mar 8 01:08:04 ovirteng02 stonith_admin[13734]: notice: crm_log_args: Invoked: stonith_admin --reboot ovirteng01.localdomain.local --tolerance 5s --tag cman Mar 8 01:08:04 ovirteng02 stonith-ng[13164]: notice: handle_request: Client stonith_admin.cman.13734.5528351f wants to fence (reboot) 'ovirteng01.localdomain.local' with device '(any)' Mar 8 01:08:04 ovirteng02 stonith-ng[13164]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for ovirteng01.localdomain.local: 1e70a341-efbf-470a-bcaa-886a8acfa9d1 (0) Mar 8 01:08:04 ovirteng02 stonith-ng[13164]: notice: can_fence_host_with_device: Fencing can fence ovirteng01.localdomain.local (aka. 'ovirteng01'): static-list Mar 8 01:08:04 ovirteng02 stonith-ng[13164]: notice: can_fence_host_with_device: Fencing can fence ovirteng01.localdomain.local (aka. 'ovirteng01'): static-list Mar 8 01:08:05 ovirteng02 pengine[13167]: notice: unpack_config: On loss of CCM Quorum: Ignore Mar 8 01:08:05 ovirteng02 pengine[13167]: warning: pe_fence_node: Node ovirteng01.localdomain.local will be fenced because the node is no longer part of the cluster Mar 8 01:08:05 ovirteng02 pengine[13167]: warning: determine_online_status: Node ovirteng01.localdomain.local is unclean Mar 8 01:08:05 ovirteng02 pengine[13167]: warning: custom_action: Action OvirtData:0_demote_0 on ovirteng01.localdomain.local is unrunnable (offline) Mar 8 01:08:05 ovirteng02