Re: [Pacemaker] pre_notify_demote is issued twice
Hi, 2014-02-24 10:49 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 21 Feb 2014, at 2:19 pm, Andrew Beekhof and...@beekhof.net wrote: On 18 Feb 2014, at 1:23 pm, Andrew Beekhof and...@beekhof.net wrote: On 6 Feb 2014, at 7:45 pm, Keisuke MORI keisuke.mori...@gmail.com wrote: Hi, I observed that pre_notify_demote is issued twice when a master resource is migrating. I'm wondering if this is the correct behavior. Steps to reproduce: - Start up 2 nodes cluster configured for the PostgreSQL streaming replication using pgsql RA as a master/slave resource. - kill the postgresql process on the master node to induce a fail-over. - The fail-over succeeds as expected, but pre_notify_demote was executed twice on each node before demoting on the master resource. 100% reproducible on my cluster. Pacemaker version: 1.1.11-rc4 (source build from the repo) OS: RHEL6.4 I have never seen this on Pacemaker-1.0.* cluster with the same configuration. The relevant logs and pe-inputs are attached. Diagnostics: (1) The first transition caused by the process failure (pe-input-160) initiates pre_notify_demote on both nodes and cancelling slave monitor on the slave node. {{{ 171 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 9: cancel prmPostgresql_cancel_1 on rhel64-2 172 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 79: notify prmPostgresql_pre_notify_demote_0 on rhel64-1 (local) 175 Jan 30 16:08:59 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 81: notify prmPostgresql_pre_notify_demote_0 on rhel64-2 }}} (2) When cancelling slave monitor completes, the transition is aborted by Resource op removal. {{{ 176 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: match_graph_event: Action prmPostgresql_monitor_1 (9) confirmed on rhel64-2 (rc=0) 177 Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request: Completed cib_delete operation for section status: OK (rc=0, origin=rhel64-2/crmd/21, version=0.37.9) 178 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: abort_transition_graph: te_update_diff:258 - Triggered transition abort (complete=0, node=rhel64-2, tag=lrm_rsc_op, id=prmPostgresql_monitor_1, magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) : Resource op removal }}} (3) The second transition is calculated by the abort (pe-input-161) which results initiating pre_notify_demote again. If the demote didn't complete (or wasn't even attempted), then we must send the pre_notify_demote again unfortunately. The real bug may well be that the transition shouldn't have been aborted. It looks legitimate: Jan 30 16:08:59 rhel64-1 crmd[8143]: info: abort_transition_graph: te_update_diff:258 - Triggered transition abort (complete=0, node=rhel64-2, tag=lrm_rsc_op, id=prmPostgresql_monitor_1, magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) : Resource op removal It looks like get_cancel_action() was not functioning correctly: https://github.com/beekhof/pacemaker/commit/9d77c99 Thanks for looking into it. I have confirmed that the issue is now resolved with the recent revision on your repo. at: https://github.com/beekhof/pacemaker/commit/04ff1bd2d144e7defd6f1f67f6bde6fa95c428e1 Thanks! -- Keisuke MORI Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request: Completed cib_delete operation for section status: OK (rc=0, origin=rhel64-2/crmd/21, version=0.37.9) It looks like part of the node status entry being removed for rhel64-2. Possibly as a result of: Jan 30 16:07:54 rhel64-2 crmd[25070]: info: erase_status_tag: Deleting xpath: //node_state[@uname='rhel64-2']/transient_attributes The new cib code, being much faster, might help here too :) {{{ 227 Jan 30 16:09:01 rhel64-1 pengine[8142]: notice: process_pe_message: Calculated Transition 15: /var/lib/pacemaker/pengine/pe-input-161.bz2 229 Jan 30 16:09:01 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 78: notify prmPostgresql_pre_notify_demote_0 on rhel64-1 (local) 232 Jan 30 16:09:01 rhel64-1 crmd[8143]: notice: te_rsc_command: Initiating action 80: notify prmPostgresql_pre_notify_demote_0 on rhel64-2 }}} I think that the transition abort at (2) should not happen. Regards, -- Keisuke MORI logs-pre-notify-20140206.tar.bz2___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] Pacemaker/corosync freeze
On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652 1952 S 0.0 0.3 0:13.25 /usr/lib/pacemaker/pengine 1252 root 20 0 33000 712 456 S 0.0 0.1 0:13.03 ha_logd: write process 1835 ntp20 0 27216 1980 1408 S 0.0 0.2 0:11.80 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:112 899 root 20 0 19168 700 488 S 0.0 0.1 0:09.75 /usr/sbin/irqbalance 1642 root 20 0 30696 1556 912 S 0.0 0.2 0:06.49 /usr/bin/monit -c /etc/monit/monitrc 4374 kamailio 20 0 291M 7272 2188 S 0.0 0.7 0:02.77 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili 3079 root0 -20 16864 4592 3508 S 0.0 0.5 0:01.51 /usr/bin/atop -a -w /var/log/atop/atop_20140306 6
Re: [Pacemaker] ordering cloned resources
On 9 Mar 2014, at 10:36 pm, Alexandre alxg...@gmail.com wrote: So..., It appears the problem doesn't come from the primitive but for the cloned resource. If I use the primitive instead of the clone in the order constraint (thus deleting the clone and the group) , the second resource of the constraint startup as expected. Any idea why? Not without logs Should I upgrade this pretty old version of pacemaker? Yes :) 2014-03-08 10:36 GMT+01:00 Alexandre alxg...@gmail.com: Hi Andrew, I have tried to stop and start the first resource of the ordering constraint (cln_san), hoping it would trigger a start attemps of the second resource of the ordering constraint (cln_mailstore). I tailed the syslog logs on the node where I was expecting the second resource to start but really nothing appeared in those logs (I grepped 'pengine as per your suggestion). I have done another test, where I changed the first resource of the ordering constraint with a very simple primitive (lsb resource), and it worked in this case. I am wondering if the issue doesn't come from the rather complicated first resource. It is a cloned group which contains a primitive conditional instance attributes... Are you aware of any specific issue in pacemaker 1.1.7 with this kind of ressources? I will try to simplify the resources by getting rid of the conditional instance attribute and try again. In the mean time I'd be delighted to hear about what you guys think about that. Regards, Alex. 2014-03-07 4:21 GMT+01:00 Andrew Beekhof and...@beekhof.net: On 3 Mar 2014, at 3:56 am, Alexandre alxg...@gmail.com wrote: Hi, I am setting up a cluster on debian wheezy. I have installed pacemaker using the debian provided packages (so am runing 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff). I have roughly 10 nodes, among which some nodes are acting as SAN (exporting block devices using AoE protocol) and others nodes acting as initiators (they are actually mail servers, storing emails on the exported devices). Bellow are the defined resources for those nodes: xml primitive class=ocf id=pri_aoe1 provider=heartbeat type=AoEtarget \ instance_attributes id=pri_aoe1.1-instance_attributes \ rule id=node-sanaoe01 score=1 \ expression attribute=#uname id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \ /rule \ nvpair id=pri_aoe1.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe1.1-instance_attributes-nic name=nic value=eth0/ \ nvpair id=pri_aoe1.1-instance_attributes-shelf name=shelf value=1/ \ nvpair id=pri_aoe1.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ instance_attributes id=pri_aoe2.1-instance_attributes \ rule id=node-sanaoe02 score=2 \ expression attribute=#uname id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \ /rule \ nvpair id=pri_aoe2.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe2.1-instance_attributes-nic name=nic value=eth1/ \ nvpair id=pri_aoe2.1-instance_attributes-shelf name=shelf value=2/ \ nvpair id=pri_aoe2.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ /primitive primitive pri_dovecot lsb:dovecot \ op start interval=0 timeout=20 \ op stop interval=0 timeout=30 \ op monitor interval=5 timeout=10 primitive pri_spamassassin lsb:spamassassin \ op start interval=0 timeout=50 \ op stop interval=0 timeout=60 \ op monitor interval=5 timeout=20 group grp_aoe pri_aoe1 group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot clone cln_mailstore grp_mailstore \ meta ordered=false interleave=true clone-max=2 clone cln_san grp_aoe \ meta ordered=true interleave=true clone-max=2 As I am in an opt-in cluster mode (symmetric-cluster=false), I have the location constraints bellow for those hosts: location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01 location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02 location LOC_MAIL_STORE_1 cln_mailstore inf: ms01 location LOC_MAIL_STORE_2 cln_mailstore inf: ms02 So far so good. I want to make sure the initiators won't try to search for exported devices before the targets actually exported them. To do so, I though I could use the following ordering constraint: order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore Unfortunately if I add this constraint the clone Set cln_mailstore never starts (or even stops if started when I add the constraint). Is there something wrong with this ordering rule? Where can i find informations on what's going on? No errors in the logs? If you grep for 'pengine' does it want to start them or just leave them stopped? ___
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 7 Mar 2014, at 5:35 pm, Yusuke Iida yusk.i...@gmail.com wrote: Hi, Andrew 2014-03-07 11:43 GMT+09:00 Andrew Beekhof and...@beekhof.net: I don't understand... crm_mon doesn't look for changes to resources or constraints and it should already be using the new faster diff format. [/me reads attachment] Ah, but perhaps I do understand afterall :-) This is repeated over and over: notice: crm_diff_update: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) notice: xml_patch_version_check: Current num_updates is too high (885 67) That would certainly drive up CPU usage and cause crm_mon to get left behind. Happily the fix for that should be: https://github.com/beekhof/pacemaker/commit/6c33820 I think that refreshment of cib is no longer repeated when a version has a difference. Thank you cope. Now, I see another problem. If crm configure load update is performed, with crm_mon started, information will no longer be displayed. Information will be displayed if crm_mon is restarted. I executed the following commands and took the log of crm_mon. # crm_mon --disable-ncurses -VV crm_mon.log 21 I am observing the cib information inside crm_mon after load was performed. Two configuration sections exist in cib after load. It seems that this is the next processing, and it remains since it failed in deletion of the configuration section. trace: cib_native_dispatch_internal: cib-reply change operation=delete path=/configuration/ A little following is the debugging log acquired by old pacemaker. It is not found in order that (null) may try to look for path=/configuration from the document tree of top. Should not path be path=/cib/configuration essentially? Yes. Could you send me the cib as well as the update you're trying to load? notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: (null) notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cib epoch=2 num_updates=6 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Tue Mar 4 11:32:36 2014 update-origin=rhel64rpmbuild update-client=crmd have-quorum=1 dc-uuid=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: configuration notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: crm_config notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cluster_property_set id=cib-bootstrap-options notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.10-2dbaf19/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cluster_property_set notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /crm_config notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nodes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: node id=3232261524 uname=rhel64rpmbuild/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /nodes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: resources/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: constraints/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /configuration notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: status notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: node_state id=3232261524 uname=rhel64rpmbuild in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member expected=member notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: lrm id=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: lrm_resources/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /lrm notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: transient_attributes id=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: instance_attributes id=status-3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=status-3232261524-shutdown name=shutdown value=0/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=status-3232261524-probe_complete name=probe_complete value=true/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /instance_attributes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /transient_attributes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /node_state notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /status notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cib notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /(null) Is this the already recognized problem? I attach the report at the time of this occurring, and the log of crm_mon. - crm_report
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew I attach CLI file which loaded. Although loaded xml does not exist as a file, I think from a log that they are the following forms. This log is extracted from the following reports. https://drive.google.com/file/d/0BwMFJItoO-fVWEw4Qnp0aHIzSm8/edit?usp=sharing Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1365 )info: cib_perform_op: Diff: +++ 0.3.0 (null) Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1438 )info: cib_perform_op: -- /configuration Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1431 )info: cib_perform_op: + /cib: @epoch=3, @num_updates=0 Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1387 )info: cib_perform_op: ++ /cib: configuration/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ crm_config Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++cluster_property_set id=cib-bootstrap-options Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=no-quorum-policy value=ignore id=cib-bootstrap-options-no-quorum-policy/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=stonith-enabled value=false id=cib-bootstrap-options-stonith-enabled/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=startup-fencing value=false id=cib-bootstrap-options-startup-fencing/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=stonith-timeout value=60s id=cib-bootstrap-options-stonith-timeout/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=crmd-transition-delay value=2s id=cib-bootstrap-options-crmd-transition-delay/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++/cluster_property_set Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ /crm_config Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nodes Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++node id=3232261508 uname=vm02/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++node id=3232261507 uname=vm01/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ /nodes Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ resources Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++primitive id=prmDummy class=ocf provider=heartbeat type=Dummy Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ !--#primitive prmDummy1 ocf:heartbeat:Dummy--/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ !--#location rsc_location-group1-1 group1 # rule 200: #uname eq vm01 # rule 100: #uname eq vm02--/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy-instance_attributes Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgctl value=/usr/bin/pg_ctl id=prmDummy-instance_attributes-pgctl/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=start_opt value=-p 5432 -h 192.168.xxx.xxx id=prmDummy-instance_attributes-start_opt/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=psql value=/usr/bin/psql id=prmDummy-instance_attributes-psql/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgdata value=/var/lib/pgsql/data id=prmDummy-instance_attributes-pgdata/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgdba value=postgres id=prmDummy-instance_attributes-pgdba/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgport value=5432 id=prmDummy-instance_attributes-pgport/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgdb value=template1 id=prmDummy-instance_attributes-pgdb/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ /instance_attributes Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ operations Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info:
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
I tried replacing pe-input-2.bz2 with pe-input-3.bz2 and saw: # cp start.xml 1.xml; tools/cibadmin --replace --xml-file replace.xml -V ( cib_file.c:268 )info: cib_file_perform_op_delegate:cib_replace on (null) ( cib_utils.c:338 ) trace: cib_perform_op: Begin cib_replace op ( cib_ops.c:258 )info: cib_process_replace: Replaced 0.2.14 with 0.5.7 from (null) ( cib_utils.c:408 ) trace: cib_perform_op: Inferring changes after cib_replace op ( xml.c:3957 )info: __xml_diff_object: transient_attributes.3232261508 moved from 1 to 0 - 15 ( xml.c:3957 )info: __xml_diff_object: lrm.3232261508 moved from 0 to 1 - 7 ... ( xml.c:1363 )info: cib_perform_op: Diff: --- 0.2.14 2 ( xml.c:1365 )info: cib_perform_op: Diff: +++ 0.6.0 e89b8f8986ecf2dfd516fd48f1711fbf ( xml.c:1431 )info: cib_perform_op: + /cib: @epoch=6, @num_updates=0, @cib-last-written=Fri Mar 7 13:24:14 2014 ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=no-quorum-policy value=ignore id=cib-bootstrap-options-no-quorum-policy/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=stonith-enabled value=false id=cib-bootstrap-options-stonith-enabled/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=startup-fencing value=false id=cib-bootstrap-options-startup-fencing/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=stonith-timeout value=60s id=cib-bootstrap-options-stonith-timeout/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=crmd-transition-delay value=2s id=cib-bootstrap-options-crmd-transition-delay/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/resources: primitive id=prmDummy class=ocf provider=heartbeat type=Dummy/ ( xml.c:1394 )info: cib_perform_op: ++ !--#primitive prmDummy1 ocf:heartbeat:Dummy--/ ( xml.c:1394 )info: cib_perform_op: ++ !--#location rsc_location-group1-1 group1 #rule 200: #uname eq vm01 # rule 100: #uname eq vm02--/ ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy-instance_attributes ... ( xml.c:1394 )info: cib_perform_op: ++ /primitive ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/resources: primitive id=prmDummy2 class=ocf provider=heartbeat type=Dummy/ ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy2-instance_attributes ... ( xml.c:1394 )info: cib_perform_op: ++ /primitive ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/resources: primitive id=prmDummy3 class=ocf provider=heartbeat type=Dummy/ ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy3-instance_attributes ... ( xml.c:1394 )info: cib_perform_op: ++ /primitive ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/resources: primitive id=prmDummy4 class=ocf provider=heartbeat type=Dummy/ ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy4-instance_attributes ... ( xml.c:1394 )info: cib_perform_op: ++ /primitive ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration: rsc_defaults/ ( xml.c:1394 )info: cib_perform_op: ++ meta_attributes id=rsc-options ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=resource-stickiness value=INFINITY id=rsc-options-resource-stickiness/ ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=migration-threshold value=1 id=rsc-options-migration-threshold/ ( xml.c:1394 )info: cib_perform_op: ++ /meta_attributes ( xml.c:1394 )info: cib_perform_op: ++ /rsc_defaults ( xml.c:1387 )info: cib_perform_op: ++ /cib/status/node_state[@id='3232261508']/transient_attributes[@id='3232261508']/instance_attributes[@id='status-3232261508']: nvpair id=status-3232261508-shutdown name=shutdown value=0/ ( xml.c:1399 )info: cib_perform_op: +~
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote: [snip] If I do this however: # cp start.xml 1.xml; tools/cibadmin --replace -o configuration --xml-file replace.some -V I start to see what you see: ( xml.c:4985 )info: validate_with_relaxng: Creating RNG parser context ( cib_file.c:268 )info: cib_file_perform_op_delegate: cib_replace on configuration ( cib_utils.c:338 ) trace: cib_perform_op:Begin cib_replace op ( xml.c:1487 ) trace: cib_perform_op:-- /configuration ( xml.c:1490 ) trace: cib_perform_op:+ cib epoch=2 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Fri Mar 7 13:24:07 2014 update-origin=vm01 update-client=crmd update-user=hacluster have-quorum=1 dc-uuid=3232261507/ ( xml.c:1490 ) trace: cib_perform_op:++ configuration ( xml.c:1490 ) trace: cib_perform_op:++ crm_config Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b , And now with improved change detection: https://github.com/beekhof/pacemaker/commit/6f364db but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org