date:20140311

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Vladislav Bogdanov

07.03.2014 10:30, Vladislav Bogdanov wrote:
 07.03.2014 05:43, Andrew Beekhof wrote:

 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 18.02.2014 03:49, Andrew Beekhof wrote:

 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, all

 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2

 All nodes are KVM virtual machines.

 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.

 The log of Retransmit List: is then outputted in large quantities from 
 corosync.

 Probably best to poke the corosync guys about this.

 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.

 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.

 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.

 More details?
 
 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?

The problem seems to be caused by the fact that crmsh does not provide
status section in both orig and new XMLs to crm_diff, and digest
generation seems to rely on that, so crm_diff and cib daemon produce
different digests.

Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
are related to the full CIB operation (with status section included),
another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
section removed like crmsh does do.

Resulting diffs differ only by digest, and that seems to be the exact issue.


cib epoch=4 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1
  configuration
crm_config
  cluster_property_set id=cib-bootstrap-options
nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/
nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/
nvpair name=symmetric-cluster value=true id=cib-bootstrap-options-symmetric-cluster/
  /cluster_property_set
/crm_config
nodes
  node id=1 uname=booter-0/
  node id=2 uname=booter-1/
/nodes
resources/
constraints/
  /configuration
  status
node_state id=1 uname=booter-0 in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member expected=member
  lrm id=1
lrm_resources/
  /lrm
  transient_attributes id=1
instance_attributes id=status-1
  nvpair id=status-1-shutdown name=shutdown value=0/
  nvpair id=status-1-probe_complete name=probe_complete value=true/
/instance_attributes
  /transient_attributes
/node_state
  /status
/cib
cib epoch=4 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1
  configuration
crm_config
  cluster_property_set id=cib-bootstrap-options
nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/
nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/
nvpair name=symmetric-cluster value=true id=cib-bootstrap-options-symmetric-cluster/
  /cluster_property_set
/crm_config
nodes
  node id=1 uname=booter-0/
  node id=2 uname=booter-1/
/nodes
resources/
constraints/
  /configuration
/cib
cib epoch=3 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1
  configuration
crm_config
  cluster_property_set id=cib-bootstrap-options
nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/
nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/
  /cluster_property_set
/crm_config
nodes
  node id=1 uname=booter-0/
  node id=2 uname=booter-1/
/nodes
resources/
constraints/
  /configuration
  status
node_state id=1 uname=booter-0 in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Yusuke Iida

Hi, Andrew

2014-03-11 14:21 GMT+09:00 Andrew Beekhof and...@beekhof.net:

 On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote:

 [snip]

 If I do this however:

 # cp start.xml 1.xml;  tools/cibadmin --replace -o configuration --xml-file 
 replace.some -V

 I start to see what you see:

 (   xml.c:4985  )info: validate_with_relaxng: Creating RNG 
 parser context
 (  cib_file.c:268   )info: cib_file_perform_op_delegate:  cib_replace on 
 configuration
 ( cib_utils.c:338   )   trace: cib_perform_op:Begin cib_replace op
 (   xml.c:1487  )   trace: cib_perform_op:-- /configuration
 (   xml.c:1490  )   trace: cib_perform_op:+  cib epoch=2 
 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 
 crm_feature_set=3.0.9 cib-last-written=Fri Mar  7 13:24:07 2014 
 update-origin=vm01 update-client=crmd update-user=hacluster 
 have-quorum=1 dc-uuid=3232261507/
 (   xml.c:1490  )   trace: cib_perform_op:++   configuration
 (   xml.c:1490  )   trace: cib_perform_op:++ crm_config

 Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b ,

 And now with improved change detection: 
 https://github.com/beekhof/pacemaker/commit/6f364db

I checked that the problem as which crm_mon does not display updating
had been solved.

BTW,
The following logs came to come out recently.
Although it seems that there is no problem in operation, when the
following logs have come out, are there any problems?

Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
te_update_diff: Ingoring create operation for /cib 0xf91c10,
configuration


 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?

The execution result of the following commands remained in /var/log/messages.

Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
cibadmin -p -R --force

I am using crmsh-1.2.6-rc3.

Thanks,
Yusuke

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-11 Thread Attila Megyeri

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:

  Thanks for the quick response!

  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze

  On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
  wrote:

  Hello,

  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses 100%
  cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.

  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in most
  of the cases, usually a kill -9 is needed.

  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

  Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.

  Logs are usually flooded with CPG related messages, such as:

  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)

  OR

  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (

  That is usually a symptom of corosync getting into a horribly confused
 state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.

  As I wrote I use Ubuntu trusty, the exact package versions are:

  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.

  There are no updates available. The only option is to install from sources,
 but that would be very difficult to maintain and I'm not sure I would get rid 
 of
 this issue.

  What do you recommend?

 The same thing as Lars, or switch to a distro that stays current with upstream
 (git shows 5 newer releases for that branch since it was released 3 years
 ago).
 If you do build from source, its probably best to go with v1.4.6

Hm, I am a bit confused here. We are using 2.3.0, which was released approx. a 
year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old 
version.
Could you please clarify a bit? :)
Lars recommends 2.3.3 git tree.

I might end up trying both, but just want to make sure I am not 
misunderstanding something badly.

Thank you!

  HTOP show something like this (sorted by TIME+ descending):

   1  [100.0%] Tasks: 59, 4
  thr; 2 running
   2  [| 0.7%] Load average: 
  1.00 0.99 1.02
   Mem[ 165/994MB] Uptime: 1
  day, 10:22:03
   Swp[   0/509MB]

   PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
   921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
 /usr/sbin/corosync
  1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
  /usr/sbin/snmpd
 -
  Lsd -Lf /dev/null -u snmp -g snm
  1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
  /usr/lib/pacemaker/cib
  1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
  /usr/lib/pacemaker/stonithd
  1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
  /usr/sbin/watchdog
  1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
  /usr/lib/pacemaker/crmd
  1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
  /usr/lib/pacemaker/lrmd
  1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
  /usr/lib/pacemaker/attrd
  1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
  1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: 
  read
 process
  1315 hacluster  20   0 73892  2652

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-11 Thread Andrew Beekhof

On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote:

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses 100%
 cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in most
 of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.

 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.

 There are no updates available. The only option is to install from sources,
 but that would be very difficult to maintain and I'm not sure I would get 
 rid of
 this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current with 
 upstream
 (git shows 5 newer releases for that branch since it was released 3 years
 ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

I swapped the 2 for a 1 somehow. A bit distracted, sorry.

 which was released approx. a year ago (you mention 3 years) and you recommend 
 1.4.6, which is a rather old version.
 Could you please clarify a bit? :)
 Lars recommends 2.3.3 git tree.

 I might end up trying both, but just want to make sure I am not 
 misunderstanding something badly.

 Thank you!

 HTOP show something like this (sorted by TIME+ descending):

 1  [100.0%] Tasks: 59, 4
 thr; 2 running
 2  [| 0.7%] Load average: 
 1.00 0.99 1.02
 Mem[ 165/994MB] Uptime: 1
 day, 10:22:03
 Swp[   0/509MB]

 PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
 /usr/sbin/corosync
 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
 /usr/sbin/snmpd
 -
 Lsd -Lf /dev/null -u snmp -g snm
 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
 /usr/lib/pacemaker/cib
 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
 /usr/lib/pacemaker/stonithd
 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
 /usr/sbin/watchdog
 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
 /usr/lib/pacemaker/crmd
 1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
 /usr/lib/pacemaker/lrmd
 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
 /usr/lib/pacemaker/attrd
 1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
 1250 root   20   0 33000  1192   928 S  0.0  0.1

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Andrew Beekhof


On 11 Mar 2014, at 6:51 pm, Yusuke Iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 
 2014-03-11 14:21 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 
 On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote:
 
 [snip]
 
 If I do this however:
 
 # cp start.xml 1.xml;  tools/cibadmin --replace -o configuration --xml-file 
 replace.some -V
 
 I start to see what you see:
 
 (   xml.c:4985  )info: validate_with_relaxng: Creating RNG 
 parser context
 (  cib_file.c:268   )info: cib_file_perform_op_delegate:  cib_replace 
 on configuration
 ( cib_utils.c:338   )   trace: cib_perform_op:Begin cib_replace op
 (   xml.c:1487  )   trace: cib_perform_op:-- /configuration
 (   xml.c:1490  )   trace: cib_perform_op:+  cib epoch=2 
 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 
 crm_feature_set=3.0.9 cib-last-written=Fri Mar  7 13:24:07 2014 
 update-origin=vm01 update-client=crmd update-user=hacluster 
 have-quorum=1 dc-uuid=3232261507/
 (   xml.c:1490  )   trace: cib_perform_op:++   configuration
 (   xml.c:1490  )   trace: cib_perform_op:++ crm_config
 
 Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b ,
 
 And now with improved change detection: 
 https://github.com/beekhof/pacemaker/commit/6f364db
 
 I checked that the problem as which crm_mon does not display updating
 had been solved.
 
 BTW,
 The following logs came to come out recently.
 Although it seems that there is no problem in operation, when the
 following logs have come out, are there any problems?
 
 Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
 te_update_diff: Ingoring create operation for /cib 0xf91c10,
 configuration

Thats interesting... is that with the fixes mentioned above?

 
 
 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?
 
 The execution result of the following commands remained in /var/log/messages.
 
 Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
 cibadmin -p -R --force

I'm somewhat confused at this point if crmsh is using --replace, then why 
is it doing diff calculations?
Or are replace operations only for the load operation?

 
 I am using crmsh-1.2.6-rc3.
 
 Thanks,
 Yusuke
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 
 -- 
 
 METRO SYSTEMS CO., LTD
 
 Yusuke Iida
 Mail: yusk.i...@gmail.com
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Andrew Beekhof


On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 07.03.2014 10:30, Vladislav Bogdanov wrote:
 07.03.2014 05:43, Andrew Beekhof wrote:
 
 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities from 
 corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.
 
 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.
 
 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.
 
 More details?
 
 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?
 
 The problem seems to be caused by the fact that crmsh does not provide
 status section in both orig and new XMLs to crm_diff, and digest
 generation seems to rely on that, so crm_diff and cib daemon produce
 different digests.
 
 Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
 are related to the full CIB operation (with status section included),
 another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
 section removed like crmsh does do.
 
 Resulting diffs differ only by digest, and that seems to be the exact issue.

This should help.  As long as crmsh isn't passing -c to crm_diff, then the 
digest will no longer be present.

  https://github.com/beekhof/pacemaker/commit/c8d443d


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Andrew Beekhof


On 12 Mar 2014, at 8:40 am, Andrew Beekhof and...@beekhof.net wrote:

 
 On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 07.03.2014 10:30, Vladislav Bogdanov wrote:
 07.03.2014 05:43, Andrew Beekhof wrote:
 
 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com 
 wrote:
 
 18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
 stopped the node of vm01 compulsorily from the inside, after starting 
 14 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities 
 from corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.
 
 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.
 
 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.
 
 More details?
 
 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?
 
 The problem seems to be caused by the fact that crmsh does not provide
 status section in both orig and new XMLs to crm_diff, and digest
 generation seems to rely on that, so crm_diff and cib daemon produce
 different digests.
 
 Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
 are related to the full CIB operation (with status section included),
 another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
 section removed like crmsh does do.
 
 Resulting diffs differ only by digest, and that seems to be the exact issue.
 
 This should help.  As long as crmsh isn't passing -c to crm_diff, then the 
 digest will no longer be present.
 
  https://github.com/beekhof/pacemaker/commit/c8d443d

Github seems to be doing something weird at the moment... here's the raw patch:

commit c8d443d8d1604dde2727cf716951231ed05926e4
Author: Andrew Beekhof and...@beekhof.net
Date:   Wed Mar 12 08:38:58 2014 +1100

Fix: crm_diff: Allow the generation of xml patchsets without digests

diff --git a/tools/xml_diff.c b/tools/xml_diff.c
index c8673b9..b98859e 100644
--- a/tools/xml_diff.c
+++ b/tools/xml_diff.c
@@ -199,7 +199,7 @@ main(int argc, char **argv)
 xml_calculate_changes(object_1, object_2);
 crm_log_xml_debug(object_2, xml_file_2?xml_file_2:target);
 
-output = xml_create_patchset(0, object_1, object_2, NULL, FALSE, TRUE);
+output = xml_create_patchset(0, object_1, object_2, NULL, FALSE, 
as_cib);
 
 if(as_cib  output) {
 int add[] = { 0, 0, 0 };



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

2014-03-11 Thread Andrew Beekhof

Sorry for the delay, sometimes it takes a while to rebuild the necessary context

On 5 Mar 2014, at 4:42 pm, Andrey Groshev gre...@yandex.ru wrote:

 
 
 05.03.2014, 04:04, Andrew Beekhof and...@beekhof.net:
 On 25 Feb 2014, at 8:30 pm, Andrey Groshev gre...@yandex.ru wrote:
 
  21.02.2014, 12:04, Andrey Groshev gre...@yandex.ru:
  21.02.2014, 05:53, Andrew Beekhof and...@beekhof.net:
   On 19 Feb 2014, at 7:53 pm, Andrey Groshev gre...@yandex.ru wrote:
19.02.2014, 09:49, Andrew Beekhof and...@beekhof.net:
On 19 Feb 2014, at 4:18 pm, Andrey Groshev gre...@yandex.ru wrote:
 19.02.2014, 09:08, Andrew Beekhof and...@beekhof.net:
 On 19 Feb 2014, at 4:00 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
  19.02.2014, 06:48, Andrew Beekhof and...@beekhof.net:
  On 18 Feb 2014, at 11:05 pm, Andrey Groshev gre...@yandex.ru 
 wrote:
   Hi, ALL and Andrew!
 
   Today is a good day - I killed a lot, and a lot of shooting 
 at me.
   In general - I am happy (almost like an elephant)   :)
   Except resources on the node are important to me eight 
 processes: 
 corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
   I killed them with different signals (4,6,11 and even 9).
   Behavior does not depend of number signal - it's good.
   If STONITH send reboot to the node - it rebooted and 
 rejoined the cluster - too it's good.
   But the behavior is different from killing various demons.
 
   Turned four groups:
   1. corosync,cib - STONITH work 100%.
   Kill via any signals - call STONITH and reboot.
 
   2. lrmd,crmd - strange behavior STONITH.
   Sometimes called STONITH - and the corresponding reaction.
   Sometimes restart daemon and restart resources with large 
 delay MS:pgsql.
   One time after restart crmd - pgsql don't restart.
 
   3. stonithd,attrd,pengine - not need STONITH
   This daemons simple restart, resources - stay running.
 
   4. pacemakerd - nothing happens.
   And then I can kill any process of the third group. They do 
 not restart.
   Generaly don't touch corosync,cib and maybe lrmd,crmd.
 
   What do you think about this?
   The main question of this topic - we decided.
   But this varied behavior - another big problem.
 
   Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
  Which of the various conditions above do the logs cover?
  All various in day.
 Are you trying to torture me?
 Can you give me a rough idea what happened when?
 No, there is 8 processes on the 4th signal and repeats the 
 experiments with unknown outcome :)
 Easier to conduct new experiments and individual new logs .
 Which variant is more interesting?
The long delay in restarting pgsql.
Everything else seems correct.
He even don't tried start pgsql.
In Logs tree the tests.
kill -s4 lrmd pid.
1. STONITH
2. STONITH
3. hangs
   Its waiting on a value for default_ping_set
 
   It seems we're calling monitor for pingCheck but for some reason its 
 not performing an update:
 
   # grep 2632.*lrmd.*pingCheck 
 /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
  info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 
 active resources)
   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
  info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 
 active resources)
   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
  info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 
 active resources)
   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
 debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19
   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
 debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0
   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
 debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ]
   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
 debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ]
   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
 debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 
 pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms
   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
 debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20
   Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
 debug: operation_finished: pingCheck_monitor_1:2816 - exited with rc=0
   Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:
 debug: operation_finished: pingCheck_monitor_1:2816:stderr [ -- empty 
 -- ]
   Feb 19

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

2014-03-11 Thread Andrew Beekhof


On 8 Mar 2014, at 11:31 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote:

 I provoke power off of ovirteng01. Fencing agent works ok on
 ovirteng02 and reboots it.
 I stop boot ofovirteng01 at grub prompt to simulate problem in boot
 (for example system put in console mode due to filesystem problem)
 In the mean time ovirteng02 becomes master of drbd resource, but
 doesn't start the group

Can you attach the following file from ovirteng02:
   /var/lib/pacemaker/pengine/pe-input-1082.bz2

That will hold the answer


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

2014-03-11 Thread Gianluca Cecchi

On Tue, Mar 11, 2014 at 11:52 PM, Andrew Beekhof and...@beekhof.net wrote:

 On 8 Mar 2014, at 11:31 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote:

 I provoke power off of ovirteng01. Fencing agent works ok on
 ovirteng02 and reboots it.
 I stop boot ofovirteng01 at grub prompt to simulate problem in boot
 (for example system put in console mode due to filesystem problem)
 In the mean time ovirteng02 becomes master of drbd resource, but
 doesn't start the group

 Can you attach the following file from ovirteng02:
/var/lib/pacemaker/pengine/pe-input-1082.bz2

 That will hold the answer


Thanks for your time Andrew.
Here it is:
https://drive.google.com/file/d/0BwoPbcrMv8mvNXI0M0dYenlRUFU/edit?usp=sharing

I note this inside the file:
constraints
  rsc_colocation id=colocation-ovirt-ms_OvirtData-INFINITY
rsc=ovirt rsc-role=Started score=INFINITY
with-rsc=ms_OvirtData with-rsc-role=Master/
  rsc_order first=ms_OvirtData first-action=promote
id=order-ms_OvirtData-ovirt-mandatory then=ovirt
then-action=start/
  rsc_location id=cli-ban-ovirt-on-ovirteng02.localdomain.local
rsc=ovirt role=Started node=ovirteng02.localdomain.local
score=-INFINITY/
  rsc_location rsc=ms_OvirtData
id=drbd-fence-by-handler-ovirt-ms_OvirtData
rule role=Master score=-INFINITY
id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData
  expression attribute=#uname operation=ne
value=ovirteng02.localdomain.local
id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/
/rule
  /rsc_location
/constraints

does this mean that a constraint remained for some reason after a
previous test, so that ovirteng02 is unable to run ovirt group?

Can I check previous pe-input files to debug when constraint was put?

By the way I just checked again both nodes with power off when primary
and it works for both as expected.
If I reproduce what above didn't work (so poweroff of ovirteng01 while
master and with group running) the group correctly starts now on
ovirteng02.
While keeping ovirteng01 (rebooted by fencing agent) on grub prompt,
the command pcs cluster edit gives this on ovirteng02:

constraints
  rsc_colocation id=colocation-ovirt-ms_OvirtData-INFINITY
rsc=ovirt rsc-role=Started score=INFINITY
with-rsc=ms_OvirtData with-rsc-role=Master/
  rsc_order first=ms_OvirtData first-action=promote
id=order-ms_OvirtData-ovirt-mandatory then=ovirt
then-action=start/
  rsc_location rsc=ms_OvirtData
id=drbd-fence-by-handler-ovirt-ms_OvirtData
rule role=Master score=-INFINITY
id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData
  expression attribute=#uname operation=ne
value=ovirteng02.localdomain.local
id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/
/rule
  /rsc_location
/constraints

So the problem seems to be the line

  rsc_location id=cli-ban-ovirt-on-ovirteng02.localdomain.local
rsc=ovirt role=Started node=ovirteng02.localdomain.local
score=-INFINITY/

correct?
could it be the effect of a pcs resource move ovirt without a pcs
resource clear ovirt?

Gianluca

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

2014-03-11 Thread Gianluca Cecchi

On Wed, Mar 12, 2014 at 12:37 AM, Andrew Beekhof and...@beekhof.net wrote:

 It was put in when drbd called:

 fence-peer /usr/lib/drbd/crm-fence-peer.sh;

 When and why it called that is not my area of expertise though.


The constraint put by crm-fence-peer.sh was
rsc_location rsc=ms_OvirtData
id=drbd-fence-by-handler-
ovirt-ms_OvirtData
rule role=Master score=-INFINITY
id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData
  expression attribute=#uname operation=ne
value=ovirteng02.localdomain.
local
id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/

and I think it was good, in the sense that from now on only ovirteng02
could start the drbd resource, as ovirteng01 was fenced.
But the problem actually was the other constraint

rsc_location id=cli-ban-ovirt-on-
ovirteng02.localdomain.local
rsc=ovirt role=Started node=ovirteng02.localdomain.local
score=-INFINITY/

preventing ovirteng02 from running ovirt group.
Going backward in the logs I see that the constraint was put 2 days
before during my previous tests (I find it in pe-input-1066.bz2)
And if I reproduce now a pcs resource move ovirt I see that the same
constraint is put inside. and it is removed as I run pcs resource
clear ovirt (I can run it on any node, not necessarily the one where
i ran the move operation)

Gianluca

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

2014-03-11 Thread Andrew Beekhof


On 12 Mar 2014, at 10:32 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote:

 On Tue, Mar 11, 2014 at 11:52 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 On 8 Mar 2014, at 11:31 am, Gianluca Cecchi gianluca.cec...@gmail.com 
 wrote:
 
 I provoke power off of ovirteng01. Fencing agent works ok on
 ovirteng02 and reboots it.
 I stop boot ofovirteng01 at grub prompt to simulate problem in boot
 (for example system put in console mode due to filesystem problem)
 In the mean time ovirteng02 becomes master of drbd resource, but
 doesn't start the group
 
 Can you attach the following file from ovirteng02:
   /var/lib/pacemaker/pengine/pe-input-1082.bz2
 
 That will hold the answer
 
 
 Thanks for your time Andrew.
 Here it is:
 https://drive.google.com/file/d/0BwoPbcrMv8mvNXI0M0dYenlRUFU/edit?usp=sharing
 
 I note this inside the file:
constraints
  rsc_colocation id=colocation-ovirt-ms_OvirtData-INFINITY
 rsc=ovirt rsc-role=Started score=INFINITY
 with-rsc=ms_OvirtData with-rsc-role=Master/
  rsc_order first=ms_OvirtData first-action=promote
 id=order-ms_OvirtData-ovirt-mandatory then=ovirt
 then-action=start/
  rsc_location id=cli-ban-ovirt-on-ovirteng02.localdomain.local
 rsc=ovirt role=Started node=ovirteng02.localdomain.local
 score=-INFINITY/
  rsc_location rsc=ms_OvirtData
 id=drbd-fence-by-handler-ovirt-ms_OvirtData
rule role=Master score=-INFINITY
 id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData
  expression attribute=#uname operation=ne
 value=ovirteng02.localdomain.local
 id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/
/rule
  /rsc_location
/constraints
 
 does this mean that a constraint remained for some reason after a
 previous test, so that ovirteng02 is unable to run ovirt group?
 
 Can I check previous pe-input files to debug when constraint was put?

It was put in when drbd called:

fence-peer /usr/lib/drbd/crm-fence-peer.sh;

When and why it called that is not my area of expertise though.

 
 By the way I just checked again both nodes with power off when primary
 and it works for both as expected.
 If I reproduce what above didn't work (so poweroff of ovirteng01 while
 master and with group running) the group correctly starts now on
 ovirteng02.
 While keeping ovirteng01 (rebooted by fencing agent) on grub prompt,
 the command pcs cluster edit gives this on ovirteng02:
 
constraints
  rsc_colocation id=colocation-ovirt-ms_OvirtData-INFINITY
 rsc=ovirt rsc-role=Started score=INFINITY
 with-rsc=ms_OvirtData with-rsc-role=Master/
  rsc_order first=ms_OvirtData first-action=promote
 id=order-ms_OvirtData-ovirt-mandatory then=ovirt
 then-action=start/
  rsc_location rsc=ms_OvirtData
 id=drbd-fence-by-handler-ovirt-ms_OvirtData
rule role=Master score=-INFINITY
 id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData
  expression attribute=#uname operation=ne
 value=ovirteng02.localdomain.local
 id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/
/rule
  /rsc_location
/constraints
 
 So the problem seems to be the line
 
  rsc_location id=cli-ban-ovirt-on-ovirteng02.localdomain.local
 rsc=ovirt role=Started node=ovirteng02.localdomain.local
 score=-INFINITY/
 
 correct?
 could it be the effect of a pcs resource move ovirt without a pcs
 resource clear ovirt?
 
 Gianluca
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Yusuke Iida

Hi, Andrew
2014-03-12 6:37 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
 te_update_diff: Ingoring create operation for /cib 0xf91c10,
 configuration

 Thats interesting... is that with the fixes mentioned above?
I'm sorry.
The above-mentioned log is not outputted by the newest Pacemaker.
The following logs come out in the newest thing.

Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
te_update_diff:  Handling create operation for /cib/configuration
0x1c37c60, fencing-topology
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
te_update_diff:  Ingoring create operation for /cib/configuration
0x1c37c60, fencing-topology
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
te_update_diff:  Handling create operation for /cib/configuration
0x1c397a0, rsc_defaults
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
te_update_diff:  Ingoring create operation for /cib/configuration
0x1c397a0, rsc_defaults

I checked code of te_update_diff.
Should not the next judgment be changed if change of fencing-topology
or rsc_defaults is processed as a configuration subordinate's change?

diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
index dd57660..f97bab5 100644
--- a/crmd/te_callbacks.c
+++ b/crmd/te_callbacks.c
@@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg)
 if(xpath == NULL) {
 /* Version field, ignore */

-} else if(strstr(xpath, /cib/configuration/)) {
+} else if(strstr(xpath, /cib/configuration)) {
 abort_transition(INFINITY, tg_restart, Non-status
change, change);

 } else if(strstr(xpath, /XML_CIB_TAG_TICKETS[) ||
safe_str_eq(name, XML_CIB_TAG_TICKETS)) {

How is such change?

I attach report at this time.
The trace log of te_update_diff is also contained.
https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing

Regards,
Yusuke



 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?

 The execution result of the following commands remained in /var/log/messages.

 Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
 cibadmin -p -R --force

 I'm somewhat confused at this point if crmsh is using --replace, then why 
 is it doing diff calculations?
 Or are replace operations only for the load operation?




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Vladislav Bogdanov

12.03.2014 00:37, Andrew Beekhof wrote:
...
 I'm somewhat confused at this point if crmsh is using --replace, then why 
 is it doing diff calculations?
 Or are replace operations only for the load operation?

It uses on of two methods depending on pacemaker version.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

2014-03-11 Thread Andrew Beekhof


On 12 Mar 2014, at 10:56 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote:

 On Wed, Mar 12, 2014 at 12:37 AM, Andrew Beekhof and...@beekhof.net wrote:
 
 It was put in when drbd called:
 
 fence-peer /usr/lib/drbd/crm-fence-peer.sh;
 
 When and why it called that is not my area of expertise though.
 
 
 The constraint put by crm-fence-peer.sh was
 rsc_location rsc=ms_OvirtData
 id=drbd-fence-by-handler-
 ovirt-ms_OvirtData
rule role=Master score=-INFINITY
 id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData
  expression attribute=#uname operation=ne
 value=ovirteng02.localdomain.
 local
 id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/
 
 and I think it was good, in the sense that from now on only ovirteng02
 could start the drbd resource, as ovirteng01 was fenced.
 But the problem actually was the other constraint
 
 rsc_location id=cli-ban-ovirt-on-
 ovirteng02.localdomain.local
 rsc=ovirt role=Started node=ovirteng02.localdomain.local
 score=-INFINITY/
 
 preventing ovirteng02 from running ovirt group.
 Going backward in the logs I see that the constraint was put 2 days
 before during my previous tests (I find it in pe-input-1066.bz2)
 And if I reproduce now a pcs resource move ovirt

Ah, yes. This would explain that part. 

 I see that the same
 constraint is put inside. and it is removed as I run pcs resource
 clear ovirt (I can run it on any node, not necessarily the one where
 i ran the move operation)
 
 Gianluca
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

Re: [Pacemaker] Pacemaker/corosync freeze

Re: [Pacemaker] Pacemaker/corosync freeze

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

Re: [Pacemaker] hangs pending

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

15 matches

Site Navigation

Mail list logo

Footer information