Re: [Pacemaker] pre_notify_demote is issued twice

2014-03-10 Thread Keisuke MORI
Hi,

2014-02-24 10:49 GMT+09:00 Andrew Beekhof and...@beekhof.net:

 On 21 Feb 2014, at 2:19 pm, Andrew Beekhof and...@beekhof.net wrote:


 On 18 Feb 2014, at 1:23 pm, Andrew Beekhof and...@beekhof.net wrote:


 On 6 Feb 2014, at 7:45 pm, Keisuke MORI keisuke.mori...@gmail.com wrote:

 Hi,

 I observed that pre_notify_demote is issued twice when a master
 resource is migrating.
 I'm wondering if this is the correct behavior.

 Steps to reproduce:

 - Start up 2 nodes cluster configured for the PostgreSQL streaming
 replication using pgsql RA as  a master/slave resource.
 - kill the postgresql process on the master node to induce a fail-over.
 - The fail-over succeeds as expected, but pre_notify_demote was
 executed twice on each node before demoting on the master resource.

 100% reproducible on my cluster.

 Pacemaker version: 1.1.11-rc4 (source build from the repo)
 OS: RHEL6.4

 I have never seen this on Pacemaker-1.0.* cluster with the same 
 configuration.

 The relevant logs and pe-inputs are attached.


 Diagnostics:

 (1) The first transition caused by the process failure (pe-input-160)
 initiates pre_notify_demote on both nodes and cancelling slave monitor
 on the slave node.
 {{{
 171 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 9: cancel prmPostgresql_cancel_1 on rhel64-2
 172 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 79: notify prmPostgresql_pre_notify_demote_0 on
 rhel64-1 (local)

 175 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 81: notify prmPostgresql_pre_notify_demote_0 on
 rhel64-2
 }}}

 (2) When cancelling slave monitor completes, the transition is aborted
 by Resource op removal.
 {{{
 176 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: match_graph_event:
 Action prmPostgresql_monitor_1 (9) confirmed on rhel64-2 (rc=0)
 177 Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request:
 Completed cib_delete operation for section status: OK (rc=0,
 origin=rhel64-2/crmd/21, version=0.37.9)
 178 Jan 30 16:08:59 rhel64-1 crmd[8143]: info:
 abort_transition_graph: te_update_diff:258 - Triggered transition
 abort (complete=0, node=rhel64-2, tag=lrm_rsc_op,
 id=prmPostgresql_monitor_1,
 magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) :
 Resource op removal
 }}}

 (3) The second transition is calculated by the abort (pe-input-161)
 which results initiating pre_notify_demote again.

 If the demote didn't complete (or wasn't even attempted), then we must send 
 the pre_notify_demote again unfortunately.
 The real bug may well be that the transition shouldn't have been aborted.

 It looks legitimate:

 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: abort_transition_graph: 
 te_update_diff:258 - Triggered transition abort (complete=0, node=rhel64-2, 
 tag=lrm_rsc_op, id=prmPostgresql_monitor_1, 
 magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) : 
 Resource op removal

 It looks like get_cancel_action() was not functioning correctly:

https://github.com/beekhof/pacemaker/commit/9d77c99


Thanks for looking into it.

I have confirmed that the issue is now resolved with the recent
revision on your repo. at:
  
https://github.com/beekhof/pacemaker/commit/04ff1bd2d144e7defd6f1f67f6bde6fa95c428e1

Thanks!

-- 
Keisuke MORI




 Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request: Completed 
 cib_delete operation for section status: OK (rc=0, origin=rhel64-2/crmd/21, 
 version=0.37.9)

 It looks like part of the node status entry being removed for rhel64-2.
 Possibly as a result of:

 Jan 30 16:07:54 rhel64-2 crmd[25070]: info: erase_status_tag: Deleting 
 xpath: //node_state[@uname='rhel64-2']/transient_attributes

 The new cib code, being much faster, might help here too :)


 {{{
 227 Jan 30 16:09:01 rhel64-1 pengine[8142]:   notice:
 process_pe_message: Calculated Transition 15:
 /var/lib/pacemaker/pengine/pe-input-161.bz2
 229 Jan 30 16:09:01 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 78: notify prmPostgresql_pre_notify_demote_0 on
 rhel64-1 (local)
 232 Jan 30 16:09:01 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 80: notify prmPostgresql_pre_notify_demote_0 on
 rhel64-2
 }}}

 I think that the transition abort at (2) should not happen.

 Regards,
 --
 Keisuke MORI
 logs-pre-notify-20140206.tar.bz2___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-10 Thread Andrew Beekhof

On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote:

 Thanks for the quick response!
 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 Hello,
 
 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses 100% cpu, but
 cannot actually match it to either the corosync or one of the pacemaker
 processes.
 
 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in most of
 the cases, usually a kill -9 is needed.
 
 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
 Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
 Logs are usually flooded with CPG related messages, such as:
 
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 
 OR
 
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 
 That is usually a symptom of corosync getting into a horribly confused state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.
 
 
 
 As I wrote I use Ubuntu trusty, the exact package versions are:
 
 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

Ah sorry, I seem to have missed that part.

 
 There are no updates available. The only option is to install from sources, 
 but that would be very difficult to maintain and I'm not sure I would get rid 
 of this issue.
 
 What do you recommend?

The same thing as Lars, or switch to a distro that stays current with upstream 
(git shows 5 newer releases for that branch since it was released 3 years ago).
If you do build from source, its probably best to go with v1.4.6

 
 
 
 HTOP show something like this (sorted by TIME+ descending):
 
 
 
  1  [100.0%] Tasks: 59, 4
 thr; 2 running
  2  [| 0.7%] Load average: 1.00 
 0.99 1.02
  Mem[ 165/994MB] Uptime: 1
 day, 10:22:03
  Swp[   0/509MB]
 
  PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
 /usr/sbin/corosync
 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
 /usr/sbin/snmpd -
 Lsd -Lf /dev/null -u snmp -g snm
 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
 /usr/lib/pacemaker/cib
 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
 /usr/lib/pacemaker/stonithd
 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
 /usr/sbin/watchdog
 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
 /usr/lib/pacemaker/crmd
 1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
 /usr/lib/pacemaker/lrmd
 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
 /usr/lib/pacemaker/attrd
 1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
 1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
 process
 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
 /usr/lib/pacemaker/pengine
 1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: 
 write process
 1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 
 /usr/sbin/ntpd -p
 /var/run/ntpd.pid -g -u 105:112
  899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
 /usr/sbin/irqbalance
 1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 
 /usr/bin/monit -c
 /etc/monit/monitrc
 4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
 3079 root0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop 
 -a -w
 /var/log/atop/atop_20140306 6

Re: [Pacemaker] ordering cloned resources

2014-03-10 Thread Andrew Beekhof

On 9 Mar 2014, at 10:36 pm, Alexandre alxg...@gmail.com wrote:

 So...,
 
 It appears the problem doesn't come from the primitive but for the
 cloned resource. If I use the primitive instead of the clone in the
 order constraint (thus deleting the clone and the group) , the second
 resource of the constraint startup as expected.
 
 Any idea why?

Not without logs

 
 Should I upgrade this pretty old version of pacemaker?

Yes :)

 
 2014-03-08 10:36 GMT+01:00 Alexandre alxg...@gmail.com:
 Hi Andrew,
 
 I have tried to stop and start the first resource of the ordering
 constraint (cln_san), hoping it would trigger a start attemps of the
 second resource of the ordering constraint (cln_mailstore).
 I tailed the syslog logs on the node where I was expecting the second
 resource to start but really nothing appeared in those logs (I grepped
 'pengine as per your suggestion).
 
 I have done another test, where I changed the first resource of the
 ordering constraint with a very simple primitive (lsb resource), and
 it worked in this case.
 
 I am wondering if the issue doesn't come from the rather complicated
 first  resource. It is a cloned group which contains a primitive
 conditional instance attributes...
 Are you aware of any specific issue in pacemaker 1.1.7 with this kind
 of ressources?
 
 I will try to simplify the resources by getting rid of the conditional
 instance attribute and try again. In the mean time I'd be delighted to
 hear about what you guys think about that.
 
 Regards, Alex.
 
 2014-03-07 4:21 GMT+01:00 Andrew Beekhof and...@beekhof.net:
 
 On 3 Mar 2014, at 3:56 am, Alexandre alxg...@gmail.com wrote:
 
 Hi,
 
 I am setting up a cluster on debian wheezy.
 I have installed pacemaker using the debian provided packages (so am
 runing  1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff).
 
 I have roughly 10 nodes, among which some nodes are acting as SAN
 (exporting block devices using AoE protocol) and others nodes acting
 as initiators (they are actually mail servers, storing emails on the
 exported devices).
 Bellow are the defined resources for those nodes:
 
 xml primitive class=ocf id=pri_aoe1 provider=heartbeat
 type=AoEtarget \
   instance_attributes id=pri_aoe1.1-instance_attributes \
   rule id=node-sanaoe01 score=1 \
   expression attribute=#uname
 id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \
   /rule \
   nvpair id=pri_aoe1.1-instance_attributes-device
 name=device value=/dev/xvdb/ \
   nvpair id=pri_aoe1.1-instance_attributes-nic
 name=nic value=eth0/ \
   nvpair id=pri_aoe1.1-instance_attributes-shelf
 name=shelf value=1/ \
   nvpair id=pri_aoe1.1-instance_attributes-slot
 name=slot value=1/ \
   /instance_attributes \
   instance_attributes id=pri_aoe2.1-instance_attributes \
   rule id=node-sanaoe02 score=2 \
   expression attribute=#uname
 id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \
   /rule \
   nvpair id=pri_aoe2.1-instance_attributes-device
 name=device value=/dev/xvdb/ \
   nvpair id=pri_aoe2.1-instance_attributes-nic
 name=nic value=eth1/ \
   nvpair id=pri_aoe2.1-instance_attributes-shelf
 name=shelf value=2/ \
   nvpair id=pri_aoe2.1-instance_attributes-slot
 name=slot value=1/ \
   /instance_attributes \
 /primitive
 primitive pri_dovecot lsb:dovecot \
   op start interval=0 timeout=20 \
   op stop interval=0 timeout=30 \
   op monitor interval=5 timeout=10
 primitive pri_spamassassin lsb:spamassassin \
   op start interval=0 timeout=50 \
   op stop interval=0 timeout=60 \
   op monitor interval=5 timeout=20
 group grp_aoe pri_aoe1
 group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot
 clone cln_mailstore grp_mailstore \
   meta ordered=false interleave=true clone-max=2
 clone cln_san grp_aoe \
   meta ordered=true interleave=true clone-max=2
 
 As I am in an opt-in cluster mode (symmetric-cluster=false), I
 have the location constraints bellow for those hosts:
 
 location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01
 location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02
 location LOC_MAIL_STORE_1 cln_mailstore inf: ms01
 location LOC_MAIL_STORE_2 cln_mailstore inf: ms02
 
 So far so good. I want to make sure the initiators won't try to search
 for exported devices before the targets actually exported them. To do
 so, I though I could use the following ordering constraint:
 
 order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore
 
 Unfortunately if I add this constraint the clone Set cln_mailstore
 never starts (or even stops if started when I add the constraint).
 
 Is there something wrong with this ordering rule?
 Where can i find informations on what's going on?
 
 No errors in the logs?
 If you grep for 'pengine' does it want to start them or just leave them 
 stopped?
 
 
 ___
 

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-10 Thread Andrew Beekhof

On 7 Mar 2014, at 5:35 pm, Yusuke Iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 2014-03-07 11:43 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 I don't understand... crm_mon doesn't look for changes to resources or 
 constraints and it should already be using the new faster diff format.
 
 [/me reads attachment]
 
 Ah, but perhaps I do understand afterall :-)
 
 This is repeated over and over:
 
  notice: crm_diff_update:  [cib_diff_notify] Patch aborted: Application 
 of an update diff failed (-206)
  notice: xml_patch_version_check:  Current num_updates is too high (885 
  67)
 
 That would certainly drive up CPU usage and cause crm_mon to get left behind.
 Happily the fix for that should be: 
 https://github.com/beekhof/pacemaker/commit/6c33820
 
 I think that refreshment of cib is no longer repeated when a version
 has a difference.
 Thank you cope.
 
 Now, I see another problem.
 
 If crm configure load update is performed, with crm_mon started,
 information will no longer be displayed.
 Information will be displayed if crm_mon is restarted.
 
 I executed the following commands and took the log of crm_mon.
 # crm_mon --disable-ncurses -VV crm_mon.log 21
 
 I am observing the cib information inside crm_mon after load was performed.
 
 Two configuration sections exist in cib after load.
 
 It seems that this is the next processing, and it remains since it
 failed in deletion of the configuration section.
   trace: cib_native_dispatch_internal: cib-reply
 change operation=delete path=/configuration/
 
 A little following is the debugging log acquired by old pacemaker.
 It is not found in order that (null)  may try to look for
 path=/configuration from the document tree of top.
 Should not path be path=/cib/configuration essentially?

Yes.  Could you send me the cib as well as the update you're trying to load?

 
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   (null)
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cib
 epoch=2 num_updates=6 admin_epoch=0
 validate-with=pacemaker-1.2 crm_feature_set=3.0.9
 cib-last-written=Tue Mar  4 11:32:36 2014
 update-origin=rhel64rpmbuild update-client=crmd have-quorum=1
 dc-uuid=3232261524
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   configuration
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: crm_config
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 cluster_property_set id=cib-bootstrap-options
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 nvpair id=cib-bootstrap-options-dc-version name=dc-version
 value=1.1.10-2dbaf19/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 nvpair id=cib-bootstrap-options-cluster-infrastructure
 name=cluster-infrastructure value=corosync/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 /cluster_property_set
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /crm_config
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nodes
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 node id=3232261524 uname=rhel64rpmbuild/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /nodes
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: resources/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
 constraints/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
 /configuration
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   status
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 node_state id=3232261524 uname=rhel64rpmbuild in_ccm=true
 crmd=online crm-debug-origin=do_state_transition join=member
 expected=member
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 lrm id=3232261524
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 lrm_resources/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /lrm
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 transient_attributes id=3232261524
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 instance_attributes id=status-3232261524
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 nvpair id=status-3232261524-shutdown name=shutdown value=0/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 nvpair id=status-3232261524-probe_complete name=probe_complete
 value=true/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 /instance_attributes
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 /transient_attributes
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /node_state
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /status
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cib
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /(null)
 
 
 Is this the already recognized problem?
 
 I attach the report at the time of this occurring, and the log of crm_mon.
 
 - crm_report
 

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-10 Thread Yusuke Iida
Hi, Andrew

I attach CLI file which loaded.
Although loaded xml does not exist as a file, I think from a log that
they are the following forms.
This log is extracted from the following reports.
https://drive.google.com/file/d/0BwMFJItoO-fVWEw4Qnp0aHIzSm8/edit?usp=sharing

Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1365  )info:
cib_perform_op: Diff: +++ 0.3.0 (null)
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1438  )info:
cib_perform_op: -- /configuration
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1431  )info:
cib_perform_op: +  /cib:  @epoch=3, @num_updates=0
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1387  )info:
cib_perform_op: ++ /cib:  configuration/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  crm_config
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++cluster_property_set
id=cib-bootstrap-options
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=no-quorum-policy
value=ignore id=cib-bootstrap-options-no-quorum-policy/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=stonith-enabled
value=false id=cib-bootstrap-options-stonith-enabled/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=startup-fencing
value=false id=cib-bootstrap-options-startup-fencing/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=stonith-timeout
value=60s id=cib-bootstrap-options-stonith-timeout/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=crmd-transition-delay
value=2s id=cib-bootstrap-options-crmd-transition-delay/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++/cluster_property_set
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  /crm_config
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nodes
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++node id=3232261508 uname=vm02/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++node id=3232261507 uname=vm01/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  /nodes
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  resources
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++primitive id=prmDummy class=ocf
provider=heartbeat type=Dummy
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  !--#primitive prmDummy1
ocf:heartbeat:Dummy--/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  !--#location rsc_location-group1-1
group1 # rule 200: #uname eq vm01 # rule 100: #uname eq vm02--/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  instance_attributes
id=prmDummy-instance_attributes
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgctl
value=/usr/bin/pg_ctl id=prmDummy-instance_attributes-pgctl/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=start_opt value=-p
5432 -h 192.168.xxx.xxx id=prmDummy-instance_attributes-start_opt/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=psql
value=/usr/bin/psql id=prmDummy-instance_attributes-psql/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgdata
value=/var/lib/pgsql/data id=prmDummy-instance_attributes-pgdata/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgdba
value=postgres id=prmDummy-instance_attributes-pgdba/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgport value=5432
id=prmDummy-instance_attributes-pgport/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgdb
value=template1 id=prmDummy-instance_attributes-pgdb/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  /instance_attributes
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  operations
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-10 Thread Andrew Beekhof
I tried replacing pe-input-2.bz2 with pe-input-3.bz2 and saw:

# cp start.xml 1.xml;  tools/cibadmin --replace --xml-file replace.xml -V
(  cib_file.c:268   )info: cib_file_perform_op_delegate:cib_replace on 
(null)
( cib_utils.c:338   )   trace: cib_perform_op:  Begin cib_replace op
(   cib_ops.c:258   )info: cib_process_replace: Replaced 0.2.14 with 
0.5.7 from (null)
( cib_utils.c:408   )   trace: cib_perform_op:  Inferring changes after 
cib_replace op
(   xml.c:3957  )info: __xml_diff_object:   
transient_attributes.3232261508 moved from 1 to 0 - 15
(   xml.c:3957  )info: __xml_diff_object:   lrm.3232261508 moved 
from 0 to 1 - 7
...
(   xml.c:1363  )info: cib_perform_op:  Diff: --- 0.2.14 2
(   xml.c:1365  )info: cib_perform_op:  Diff: +++ 0.6.0 
e89b8f8986ecf2dfd516fd48f1711fbf
(   xml.c:1431  )info: cib_perform_op:  +  /cib:  @epoch=6, 
@num_updates=0, @cib-last-written=Fri Mar  7 13:24:14 2014
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=no-quorum-policy value=ignore 
id=cib-bootstrap-options-no-quorum-policy/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=stonith-enabled value=false 
id=cib-bootstrap-options-stonith-enabled/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=startup-fencing value=false 
id=cib-bootstrap-options-startup-fencing/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=stonith-timeout value=60s 
id=cib-bootstrap-options-stonith-timeout/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=crmd-transition-delay value=2s 
id=cib-bootstrap-options-crmd-transition-delay/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/resources:  primitive id=prmDummy class=ocf 
provider=heartbeat type=Dummy/
(   xml.c:1394  )info: cib_perform_op:  ++  
!--#primitive prmDummy1 ocf:heartbeat:Dummy--/
(   xml.c:1394  )info: cib_perform_op:  ++  
!--#location rsc_location-group1-1 group1 #rule 200: #uname eq 
vm01 #  rule 100: #uname eq vm02--/
(   xml.c:1394  )info: cib_perform_op:  ++  
instance_attributes id=prmDummy-instance_attributes
...
(   xml.c:1394  )info: cib_perform_op:  ++  
  /primitive
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/resources:  primitive id=prmDummy2 class=ocf 
provider=heartbeat type=Dummy/
(   xml.c:1394  )info: cib_perform_op:  ++  
instance_attributes id=prmDummy2-instance_attributes
...
(   xml.c:1394  )info: cib_perform_op:  ++  
  /primitive
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/resources:  primitive id=prmDummy3 class=ocf 
provider=heartbeat type=Dummy/
(   xml.c:1394  )info: cib_perform_op:  ++  
instance_attributes id=prmDummy3-instance_attributes
...
(   xml.c:1394  )info: cib_perform_op:  ++  
  /primitive
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/resources:  primitive id=prmDummy4 class=ocf 
provider=heartbeat type=Dummy/
(   xml.c:1394  )info: cib_perform_op:  ++  
instance_attributes id=prmDummy4-instance_attributes
...
(   xml.c:1394  )info: cib_perform_op:  ++  
  /primitive
(   xml.c:1387  )info: cib_perform_op:  ++ /cib/configuration:  
rsc_defaults/
(   xml.c:1394  )info: cib_perform_op:  ++
meta_attributes id=rsc-options
(   xml.c:1394  )info: cib_perform_op:  ++  
nvpair name=resource-stickiness value=INFINITY 
id=rsc-options-resource-stickiness/
(   xml.c:1394  )info: cib_perform_op:  ++  
nvpair name=migration-threshold value=1 
id=rsc-options-migration-threshold/
(   xml.c:1394  )info: cib_perform_op:  ++
/meta_attributes
(   xml.c:1394  )info: cib_perform_op:  ++  
/rsc_defaults
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/status/node_state[@id='3232261508']/transient_attributes[@id='3232261508']/instance_attributes[@id='status-3232261508']:
  nvpair id=status-3232261508-shutdown name=shutdown value=0/
(   xml.c:1399  )info: cib_perform_op:  +~ 

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-10 Thread Andrew Beekhof

On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote:

[snip]

 If I do this however:
 
 # cp start.xml 1.xml;  tools/cibadmin --replace -o configuration --xml-file 
 replace.some -V
 
 I start to see what you see:
 
 (   xml.c:4985  )info: validate_with_relaxng: Creating RNG 
 parser context
 (  cib_file.c:268   )info: cib_file_perform_op_delegate:  cib_replace on 
 configuration
 ( cib_utils.c:338   )   trace: cib_perform_op:Begin cib_replace op
 (   xml.c:1487  )   trace: cib_perform_op:-- /configuration
 (   xml.c:1490  )   trace: cib_perform_op:+  cib epoch=2 
 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 
 crm_feature_set=3.0.9 cib-last-written=Fri Mar  7 13:24:07 2014 
 update-origin=vm01 update-client=crmd update-user=hacluster 
 have-quorum=1 dc-uuid=3232261507/
 (   xml.c:1490  )   trace: cib_perform_op:++   configuration
 (   xml.c:1490  )   trace: cib_perform_op:++ crm_config
 
 Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b ,

And now with improved change detection: 
https://github.com/beekhof/pacemaker/commit/6f364db

 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org