Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-31 Thread Yusuke Iida
Hi, Andrew

crm_mon has the processing which makes cib the newest, when
pcmk_err_old_data is still received.

Since this processing can be considered to be unnecessary like the
processing changed by stonithd, I correct this.

Please merge the following, if satisfactory.
https://github.com/ClusterLabs/pacemaker/pull/477

Regards,
Yusuke

2014-03-18 9:56 GMT+09:00 Andrew Beekhof and...@beekhof.net:

 On 12 Mar 2014, at 1:45 pm, Yusuke Iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 2014-03-12 6:37 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
 te_update_diff: Ingoring create operation for /cib 0xf91c10,
 configuration

 Thats interesting... is that with the fixes mentioned above?
 I'm sorry.
 The above-mentioned log is not outputted by the newest Pacemaker.
 The following logs come out in the newest thing.

 Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
 te_update_diff:  Handling create operation for /cib/configuration
 0x1c37c60, fencing-topology
 Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
 te_update_diff:  Ingoring create operation for /cib/configuration
 0x1c37c60, fencing-topology
 Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
 te_update_diff:  Handling create operation for /cib/configuration
 0x1c397a0, rsc_defaults
 Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
 te_update_diff:  Ingoring create operation for /cib/configuration
 0x1c397a0, rsc_defaults

 I checked code of te_update_diff.
 Should not the next judgment be changed if change of fencing-topology
 or rsc_defaults is processed as a configuration subordinate's change?

 Perfect!

   https://github.com/beekhof/pacemaker/commit/1c285ac

 Thanks to everyone for giving the new CIB a pounding, we should be in very 
 good shape for a release soon :-)


 diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
 index dd57660..f97bab5 100644
 --- a/crmd/te_callbacks.c
 +++ b/crmd/te_callbacks.c
 @@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg)
 if(xpath == NULL) {
 /* Version field, ignore */

 -} else if(strstr(xpath, /cib/configuration/)) {
 +} else if(strstr(xpath, /cib/configuration)) {
 abort_transition(INFINITY, tg_restart, Non-status
 change, change);

 } else if(strstr(xpath, /XML_CIB_TAG_TICKETS[) ||
 safe_str_eq(name, XML_CIB_TAG_TICKETS)) {

 How is such change?

 I attach report at this time.
 The trace log of te_update_diff is also contained.
 https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing

 Regards,
 Yusuke



 but it looks like crmsh is doing something funny with its updates... 
 does anyone know what command it is running?

 The execution result of the following commands remained in 
 /var/log/messages.

 Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
 cibadmin -p -R --force

 I'm somewhat confused at this point if crmsh is using --replace, then 
 why is it doing diff calculations?
 Or are replace operations only for the load operation?




 --
 
 METRO SYSTEMS CO., LTD

 Yusuke Iida
 Mail: yusk.i...@gmail.com
 

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-17 Thread Andrew Beekhof

On 12 Mar 2014, at 1:45 pm, Yusuke Iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 2014-03-12 6:37 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
 te_update_diff: Ingoring create operation for /cib 0xf91c10,
 configuration
 
 Thats interesting... is that with the fixes mentioned above?
 I'm sorry.
 The above-mentioned log is not outputted by the newest Pacemaker.
 The following logs come out in the newest thing.
 
 Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
 te_update_diff:  Handling create operation for /cib/configuration
 0x1c37c60, fencing-topology
 Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
 te_update_diff:  Ingoring create operation for /cib/configuration
 0x1c37c60, fencing-topology
 Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
 te_update_diff:  Handling create operation for /cib/configuration
 0x1c397a0, rsc_defaults
 Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
 te_update_diff:  Ingoring create operation for /cib/configuration
 0x1c397a0, rsc_defaults
 
 I checked code of te_update_diff.
 Should not the next judgment be changed if change of fencing-topology
 or rsc_defaults is processed as a configuration subordinate's change?

Perfect!

  https://github.com/beekhof/pacemaker/commit/1c285ac

Thanks to everyone for giving the new CIB a pounding, we should be in very good 
shape for a release soon :-)

 
 diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
 index dd57660..f97bab5 100644
 --- a/crmd/te_callbacks.c
 +++ b/crmd/te_callbacks.c
 @@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg)
 if(xpath == NULL) {
 /* Version field, ignore */
 
 -} else if(strstr(xpath, /cib/configuration/)) {
 +} else if(strstr(xpath, /cib/configuration)) {
 abort_transition(INFINITY, tg_restart, Non-status
 change, change);
 
 } else if(strstr(xpath, /XML_CIB_TAG_TICKETS[) ||
 safe_str_eq(name, XML_CIB_TAG_TICKETS)) {
 
 How is such change?
 
 I attach report at this time.
 The trace log of te_update_diff is also contained.
 https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing
 
 Regards,
 Yusuke
 
 
 
 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?
 
 The execution result of the following commands remained in 
 /var/log/messages.
 
 Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
 cibadmin -p -R --force
 
 I'm somewhat confused at this point if crmsh is using --replace, then 
 why is it doing diff calculations?
 Or are replace operations only for the load operation?
 
 
 
 
 -- 
 
 METRO SYSTEMS CO., LTD
 
 Yusuke Iida
 Mail: yusk.i...@gmail.com
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-12 Thread Vladislav Bogdanov
12.03.2014 00:40, Andrew Beekhof wrote:
 
 On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 07.03.2014 10:30, Vladislav Bogdanov wrote:
 07.03.2014 05:43, Andrew Beekhof wrote:

 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com 
 wrote:

 18.02.2014 03:49, Andrew Beekhof wrote:

 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, all

 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2

 All nodes are KVM virtual machines.

 stopped the node of vm01 compulsorily from the inside, after starting 
 14 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.

 The log of Retransmit List: is then outputted in large quantities 
 from corosync.

 Probably best to poke the corosync guys about this.

 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.

 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.

 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.

 More details?

 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?

 The problem seems to be caused by the fact that crmsh does not provide
 status section in both orig and new XMLs to crm_diff, and digest
 generation seems to rely on that, so crm_diff and cib daemon produce
 different digests.

 Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
 are related to the full CIB operation (with status section included),
 another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
 section removed like crmsh does do.

 Resulting diffs differ only by digest, and that seems to be the exact issue.
 
 This should help.  As long as crmsh isn't passing -c to crm_diff, then the 
 digest will no longer be present.
 
   https://github.com/beekhof/pacemaker/commit/c8d443d

Yep, that helped.
Thank you!


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Vladislav Bogdanov
07.03.2014 10:30, Vladislav Bogdanov wrote:
 07.03.2014 05:43, Andrew Beekhof wrote:

 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 18.02.2014 03:49, Andrew Beekhof wrote:

 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, all

 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2

 All nodes are KVM virtual machines.

 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.

 The log of Retransmit List: is then outputted in large quantities from 
 corosync.

 Probably best to poke the corosync guys about this.

 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.

 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.

 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.

 More details?
 
 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?

The problem seems to be caused by the fact that crmsh does not provide
status section in both orig and new XMLs to crm_diff, and digest
generation seems to rely on that, so crm_diff and cib daemon produce
different digests.

Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
are related to the full CIB operation (with status section included),
another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
section removed like crmsh does do.

Resulting diffs differ only by digest, and that seems to be the exact issue.


cib epoch=4 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1
  configuration
crm_config
  cluster_property_set id=cib-bootstrap-options
nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/
nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/
nvpair name=symmetric-cluster value=true id=cib-bootstrap-options-symmetric-cluster/
  /cluster_property_set
/crm_config
nodes
  node id=1 uname=booter-0/
  node id=2 uname=booter-1/
/nodes
resources/
constraints/
  /configuration
  status
node_state id=1 uname=booter-0 in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member expected=member
  lrm id=1
lrm_resources/
  /lrm
  transient_attributes id=1
instance_attributes id=status-1
  nvpair id=status-1-shutdown name=shutdown value=0/
  nvpair id=status-1-probe_complete name=probe_complete value=true/
/instance_attributes
  /transient_attributes
/node_state
  /status
/cib
cib epoch=4 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1
  configuration
crm_config
  cluster_property_set id=cib-bootstrap-options
nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/
nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/
nvpair name=symmetric-cluster value=true id=cib-bootstrap-options-symmetric-cluster/
  /cluster_property_set
/crm_config
nodes
  node id=1 uname=booter-0/
  node id=2 uname=booter-1/
/nodes
resources/
constraints/
  /configuration
/cib
cib epoch=3 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1
  configuration
crm_config
  cluster_property_set id=cib-bootstrap-options
nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/
nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/
  /cluster_property_set
/crm_config
nodes
  node id=1 uname=booter-0/
  node id=2 uname=booter-1/
/nodes
resources/
constraints/
  /configuration
  status
node_state id=1 uname=booter-0 in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member 

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Yusuke Iida
Hi, Andrew

2014-03-11 14:21 GMT+09:00 Andrew Beekhof and...@beekhof.net:

 On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote:

 [snip]

 If I do this however:

 # cp start.xml 1.xml;  tools/cibadmin --replace -o configuration --xml-file 
 replace.some -V

 I start to see what you see:

 (   xml.c:4985  )info: validate_with_relaxng: Creating RNG 
 parser context
 (  cib_file.c:268   )info: cib_file_perform_op_delegate:  cib_replace on 
 configuration
 ( cib_utils.c:338   )   trace: cib_perform_op:Begin cib_replace op
 (   xml.c:1487  )   trace: cib_perform_op:-- /configuration
 (   xml.c:1490  )   trace: cib_perform_op:+  cib epoch=2 
 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 
 crm_feature_set=3.0.9 cib-last-written=Fri Mar  7 13:24:07 2014 
 update-origin=vm01 update-client=crmd update-user=hacluster 
 have-quorum=1 dc-uuid=3232261507/
 (   xml.c:1490  )   trace: cib_perform_op:++   configuration
 (   xml.c:1490  )   trace: cib_perform_op:++ crm_config

 Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b ,

 And now with improved change detection: 
 https://github.com/beekhof/pacemaker/commit/6f364db

I checked that the problem as which crm_mon does not display updating
had been solved.

BTW,
The following logs came to come out recently.
Although it seems that there is no problem in operation, when the
following logs have come out, are there any problems?

Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
te_update_diff: Ingoring create operation for /cib 0xf91c10,
configuration


 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?

The execution result of the following commands remained in /var/log/messages.

Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
cibadmin -p -R --force

I am using crmsh-1.2.6-rc3.

Thanks,
Yusuke

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Andrew Beekhof

On 11 Mar 2014, at 6:51 pm, Yusuke Iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 
 2014-03-11 14:21 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 
 On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote:
 
 [snip]
 
 If I do this however:
 
 # cp start.xml 1.xml;  tools/cibadmin --replace -o configuration --xml-file 
 replace.some -V
 
 I start to see what you see:
 
 (   xml.c:4985  )info: validate_with_relaxng: Creating RNG 
 parser context
 (  cib_file.c:268   )info: cib_file_perform_op_delegate:  cib_replace 
 on configuration
 ( cib_utils.c:338   )   trace: cib_perform_op:Begin cib_replace op
 (   xml.c:1487  )   trace: cib_perform_op:-- /configuration
 (   xml.c:1490  )   trace: cib_perform_op:+  cib epoch=2 
 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 
 crm_feature_set=3.0.9 cib-last-written=Fri Mar  7 13:24:07 2014 
 update-origin=vm01 update-client=crmd update-user=hacluster 
 have-quorum=1 dc-uuid=3232261507/
 (   xml.c:1490  )   trace: cib_perform_op:++   configuration
 (   xml.c:1490  )   trace: cib_perform_op:++ crm_config
 
 Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b ,
 
 And now with improved change detection: 
 https://github.com/beekhof/pacemaker/commit/6f364db
 
 I checked that the problem as which crm_mon does not display updating
 had been solved.
 
 BTW,
 The following logs came to come out recently.
 Although it seems that there is no problem in operation, when the
 following logs have come out, are there any problems?
 
 Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
 te_update_diff: Ingoring create operation for /cib 0xf91c10,
 configuration

Thats interesting... is that with the fixes mentioned above?

 
 
 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?
 
 The execution result of the following commands remained in /var/log/messages.
 
 Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
 cibadmin -p -R --force

I'm somewhat confused at this point if crmsh is using --replace, then why 
is it doing diff calculations?
Or are replace operations only for the load operation?

 
 I am using crmsh-1.2.6-rc3.
 
 Thanks,
 Yusuke
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 
 -- 
 
 METRO SYSTEMS CO., LTD
 
 Yusuke Iida
 Mail: yusk.i...@gmail.com
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Andrew Beekhof

On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 07.03.2014 10:30, Vladislav Bogdanov wrote:
 07.03.2014 05:43, Andrew Beekhof wrote:
 
 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities from 
 corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.
 
 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.
 
 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.
 
 More details?
 
 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?
 
 The problem seems to be caused by the fact that crmsh does not provide
 status section in both orig and new XMLs to crm_diff, and digest
 generation seems to rely on that, so crm_diff and cib daemon produce
 different digests.
 
 Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
 are related to the full CIB operation (with status section included),
 another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
 section removed like crmsh does do.
 
 Resulting diffs differ only by digest, and that seems to be the exact issue.

This should help.  As long as crmsh isn't passing -c to crm_diff, then the 
digest will no longer be present.

  https://github.com/beekhof/pacemaker/commit/c8d443d


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Andrew Beekhof

On 12 Mar 2014, at 8:40 am, Andrew Beekhof and...@beekhof.net wrote:

 
 On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 07.03.2014 10:30, Vladislav Bogdanov wrote:
 07.03.2014 05:43, Andrew Beekhof wrote:
 
 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com 
 wrote:
 
 18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
 stopped the node of vm01 compulsorily from the inside, after starting 
 14 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities 
 from corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.
 
 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.
 
 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.
 
 More details?
 
 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?
 
 The problem seems to be caused by the fact that crmsh does not provide
 status section in both orig and new XMLs to crm_diff, and digest
 generation seems to rely on that, so crm_diff and cib daemon produce
 different digests.
 
 Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml)
 are related to the full CIB operation (with status section included),
 another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that
 section removed like crmsh does do.
 
 Resulting diffs differ only by digest, and that seems to be the exact issue.
 
 This should help.  As long as crmsh isn't passing -c to crm_diff, then the 
 digest will no longer be present.
 
  https://github.com/beekhof/pacemaker/commit/c8d443d

Github seems to be doing something weird at the moment... here's the raw patch:

commit c8d443d8d1604dde2727cf716951231ed05926e4
Author: Andrew Beekhof and...@beekhof.net
Date:   Wed Mar 12 08:38:58 2014 +1100

Fix: crm_diff: Allow the generation of xml patchsets without digests

diff --git a/tools/xml_diff.c b/tools/xml_diff.c
index c8673b9..b98859e 100644
--- a/tools/xml_diff.c
+++ b/tools/xml_diff.c
@@ -199,7 +199,7 @@ main(int argc, char **argv)
 xml_calculate_changes(object_1, object_2);
 crm_log_xml_debug(object_2, xml_file_2?xml_file_2:target);
 
-output = xml_create_patchset(0, object_1, object_2, NULL, FALSE, TRUE);
+output = xml_create_patchset(0, object_1, object_2, NULL, FALSE, 
as_cib);
 
 if(as_cib  output) {
 int add[] = { 0, 0, 0 };



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Yusuke Iida
Hi, Andrew
2014-03-12 6:37 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
 te_update_diff: Ingoring create operation for /cib 0xf91c10,
 configuration

 Thats interesting... is that with the fixes mentioned above?
I'm sorry.
The above-mentioned log is not outputted by the newest Pacemaker.
The following logs come out in the newest thing.

Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
te_update_diff:  Handling create operation for /cib/configuration
0x1c37c60, fencing-topology
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
te_update_diff:  Ingoring create operation for /cib/configuration
0x1c37c60, fencing-topology
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
te_update_diff:  Handling create operation for /cib/configuration
0x1c397a0, rsc_defaults
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
te_update_diff:  Ingoring create operation for /cib/configuration
0x1c397a0, rsc_defaults

I checked code of te_update_diff.
Should not the next judgment be changed if change of fencing-topology
or rsc_defaults is processed as a configuration subordinate's change?

diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
index dd57660..f97bab5 100644
--- a/crmd/te_callbacks.c
+++ b/crmd/te_callbacks.c
@@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg)
 if(xpath == NULL) {
 /* Version field, ignore */

-} else if(strstr(xpath, /cib/configuration/)) {
+} else if(strstr(xpath, /cib/configuration)) {
 abort_transition(INFINITY, tg_restart, Non-status
change, change);

 } else if(strstr(xpath, /XML_CIB_TAG_TICKETS[) ||
safe_str_eq(name, XML_CIB_TAG_TICKETS)) {

How is such change?

I attach report at this time.
The trace log of te_update_diff is also contained.
https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing

Regards,
Yusuke



 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?

 The execution result of the following commands remained in /var/log/messages.

 Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
 cibadmin -p -R --force

 I'm somewhat confused at this point if crmsh is using --replace, then why 
 is it doing diff calculations?
 Or are replace operations only for the load operation?




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-11 Thread Vladislav Bogdanov
12.03.2014 00:37, Andrew Beekhof wrote:
...
 I'm somewhat confused at this point if crmsh is using --replace, then why 
 is it doing diff calculations?
 Or are replace operations only for the load operation?

It uses on of two methods depending on pacemaker version.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-10 Thread Andrew Beekhof

On 7 Mar 2014, at 5:35 pm, Yusuke Iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 2014-03-07 11:43 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 I don't understand... crm_mon doesn't look for changes to resources or 
 constraints and it should already be using the new faster diff format.
 
 [/me reads attachment]
 
 Ah, but perhaps I do understand afterall :-)
 
 This is repeated over and over:
 
  notice: crm_diff_update:  [cib_diff_notify] Patch aborted: Application 
 of an update diff failed (-206)
  notice: xml_patch_version_check:  Current num_updates is too high (885 
  67)
 
 That would certainly drive up CPU usage and cause crm_mon to get left behind.
 Happily the fix for that should be: 
 https://github.com/beekhof/pacemaker/commit/6c33820
 
 I think that refreshment of cib is no longer repeated when a version
 has a difference.
 Thank you cope.
 
 Now, I see another problem.
 
 If crm configure load update is performed, with crm_mon started,
 information will no longer be displayed.
 Information will be displayed if crm_mon is restarted.
 
 I executed the following commands and took the log of crm_mon.
 # crm_mon --disable-ncurses -VV crm_mon.log 21
 
 I am observing the cib information inside crm_mon after load was performed.
 
 Two configuration sections exist in cib after load.
 
 It seems that this is the next processing, and it remains since it
 failed in deletion of the configuration section.
   trace: cib_native_dispatch_internal: cib-reply
 change operation=delete path=/configuration/
 
 A little following is the debugging log acquired by old pacemaker.
 It is not found in order that (null)  may try to look for
 path=/configuration from the document tree of top.
 Should not path be path=/cib/configuration essentially?

Yes.  Could you send me the cib as well as the update you're trying to load?

 
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   (null)
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cib
 epoch=2 num_updates=6 admin_epoch=0
 validate-with=pacemaker-1.2 crm_feature_set=3.0.9
 cib-last-written=Tue Mar  4 11:32:36 2014
 update-origin=rhel64rpmbuild update-client=crmd have-quorum=1
 dc-uuid=3232261524
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   configuration
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: crm_config
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 cluster_property_set id=cib-bootstrap-options
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 nvpair id=cib-bootstrap-options-dc-version name=dc-version
 value=1.1.10-2dbaf19/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 nvpair id=cib-bootstrap-options-cluster-infrastructure
 name=cluster-infrastructure value=corosync/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 /cluster_property_set
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /crm_config
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nodes
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 node id=3232261524 uname=rhel64rpmbuild/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /nodes
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: resources/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
 constraints/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
 /configuration
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   status
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 node_state id=3232261524 uname=rhel64rpmbuild in_ccm=true
 crmd=online crm-debug-origin=do_state_transition join=member
 expected=member
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 lrm id=3232261524
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 lrm_resources/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /lrm
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 transient_attributes id=3232261524
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 instance_attributes id=status-3232261524
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 nvpair id=status-3232261524-shutdown name=shutdown value=0/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 nvpair id=status-3232261524-probe_complete name=probe_complete
 value=true/
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 /instance_attributes
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
 /transient_attributes
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /node_state
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /status
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cib
 notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /(null)
 
 
 Is this the already recognized problem?
 
 I attach the report at the time of this occurring, and the log of crm_mon.
 
 - crm_report
 

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-10 Thread Yusuke Iida
Hi, Andrew

I attach CLI file which loaded.
Although loaded xml does not exist as a file, I think from a log that
they are the following forms.
This log is extracted from the following reports.
https://drive.google.com/file/d/0BwMFJItoO-fVWEw4Qnp0aHIzSm8/edit?usp=sharing

Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1365  )info:
cib_perform_op: Diff: +++ 0.3.0 (null)
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1438  )info:
cib_perform_op: -- /configuration
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1431  )info:
cib_perform_op: +  /cib:  @epoch=3, @num_updates=0
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1387  )info:
cib_perform_op: ++ /cib:  configuration/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  crm_config
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++cluster_property_set
id=cib-bootstrap-options
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=no-quorum-policy
value=ignore id=cib-bootstrap-options-no-quorum-policy/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=stonith-enabled
value=false id=cib-bootstrap-options-stonith-enabled/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=startup-fencing
value=false id=cib-bootstrap-options-startup-fencing/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=stonith-timeout
value=60s id=cib-bootstrap-options-stonith-timeout/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nvpair name=crmd-transition-delay
value=2s id=cib-bootstrap-options-crmd-transition-delay/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++/cluster_property_set
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  /crm_config
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  nodes
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++node id=3232261508 uname=vm02/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++node id=3232261507 uname=vm01/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  /nodes
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  resources
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++primitive id=prmDummy class=ocf
provider=heartbeat type=Dummy
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  !--#primitive prmDummy1
ocf:heartbeat:Dummy--/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  !--#location rsc_location-group1-1
group1 # rule 200: #uname eq vm01 # rule 100: #uname eq vm02--/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  instance_attributes
id=prmDummy-instance_attributes
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgctl
value=/usr/bin/pg_ctl id=prmDummy-instance_attributes-pgctl/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=start_opt value=-p
5432 -h 192.168.xxx.xxx id=prmDummy-instance_attributes-start_opt/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=psql
value=/usr/bin/psql id=prmDummy-instance_attributes-psql/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgdata
value=/var/lib/pgsql/data id=prmDummy-instance_attributes-pgdata/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgdba
value=postgres id=prmDummy-instance_attributes-pgdba/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgport value=5432
id=prmDummy-instance_attributes-pgport/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++nvpair name=pgdb
value=template1 id=prmDummy-instance_attributes-pgdb/
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  /instance_attributes
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:
cib_perform_op: ++  operations
Mar 07 13:24:14 [2523] vm01cib: (   xml.c:1394  )info:

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-10 Thread Andrew Beekhof
I tried replacing pe-input-2.bz2 with pe-input-3.bz2 and saw:

# cp start.xml 1.xml;  tools/cibadmin --replace --xml-file replace.xml -V
(  cib_file.c:268   )info: cib_file_perform_op_delegate:cib_replace on 
(null)
( cib_utils.c:338   )   trace: cib_perform_op:  Begin cib_replace op
(   cib_ops.c:258   )info: cib_process_replace: Replaced 0.2.14 with 
0.5.7 from (null)
( cib_utils.c:408   )   trace: cib_perform_op:  Inferring changes after 
cib_replace op
(   xml.c:3957  )info: __xml_diff_object:   
transient_attributes.3232261508 moved from 1 to 0 - 15
(   xml.c:3957  )info: __xml_diff_object:   lrm.3232261508 moved 
from 0 to 1 - 7
...
(   xml.c:1363  )info: cib_perform_op:  Diff: --- 0.2.14 2
(   xml.c:1365  )info: cib_perform_op:  Diff: +++ 0.6.0 
e89b8f8986ecf2dfd516fd48f1711fbf
(   xml.c:1431  )info: cib_perform_op:  +  /cib:  @epoch=6, 
@num_updates=0, @cib-last-written=Fri Mar  7 13:24:14 2014
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=no-quorum-policy value=ignore 
id=cib-bootstrap-options-no-quorum-policy/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=stonith-enabled value=false 
id=cib-bootstrap-options-stonith-enabled/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=startup-fencing value=false 
id=cib-bootstrap-options-startup-fencing/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=stonith-timeout value=60s 
id=cib-bootstrap-options-stonith-timeout/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:
  nvpair name=crmd-transition-delay value=2s 
id=cib-bootstrap-options-crmd-transition-delay/
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/resources:  primitive id=prmDummy class=ocf 
provider=heartbeat type=Dummy/
(   xml.c:1394  )info: cib_perform_op:  ++  
!--#primitive prmDummy1 ocf:heartbeat:Dummy--/
(   xml.c:1394  )info: cib_perform_op:  ++  
!--#location rsc_location-group1-1 group1 #rule 200: #uname eq 
vm01 #  rule 100: #uname eq vm02--/
(   xml.c:1394  )info: cib_perform_op:  ++  
instance_attributes id=prmDummy-instance_attributes
...
(   xml.c:1394  )info: cib_perform_op:  ++  
  /primitive
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/resources:  primitive id=prmDummy2 class=ocf 
provider=heartbeat type=Dummy/
(   xml.c:1394  )info: cib_perform_op:  ++  
instance_attributes id=prmDummy2-instance_attributes
...
(   xml.c:1394  )info: cib_perform_op:  ++  
  /primitive
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/resources:  primitive id=prmDummy3 class=ocf 
provider=heartbeat type=Dummy/
(   xml.c:1394  )info: cib_perform_op:  ++  
instance_attributes id=prmDummy3-instance_attributes
...
(   xml.c:1394  )info: cib_perform_op:  ++  
  /primitive
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/configuration/resources:  primitive id=prmDummy4 class=ocf 
provider=heartbeat type=Dummy/
(   xml.c:1394  )info: cib_perform_op:  ++  
instance_attributes id=prmDummy4-instance_attributes
...
(   xml.c:1394  )info: cib_perform_op:  ++  
  /primitive
(   xml.c:1387  )info: cib_perform_op:  ++ /cib/configuration:  
rsc_defaults/
(   xml.c:1394  )info: cib_perform_op:  ++
meta_attributes id=rsc-options
(   xml.c:1394  )info: cib_perform_op:  ++  
nvpair name=resource-stickiness value=INFINITY 
id=rsc-options-resource-stickiness/
(   xml.c:1394  )info: cib_perform_op:  ++  
nvpair name=migration-threshold value=1 
id=rsc-options-migration-threshold/
(   xml.c:1394  )info: cib_perform_op:  ++
/meta_attributes
(   xml.c:1394  )info: cib_perform_op:  ++  
/rsc_defaults
(   xml.c:1387  )info: cib_perform_op:  ++ 
/cib/status/node_state[@id='3232261508']/transient_attributes[@id='3232261508']/instance_attributes[@id='status-3232261508']:
  nvpair id=status-3232261508-shutdown name=shutdown value=0/
(   xml.c:1399  )info: cib_perform_op:  +~ 

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-10 Thread Andrew Beekhof

On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote:

[snip]

 If I do this however:
 
 # cp start.xml 1.xml;  tools/cibadmin --replace -o configuration --xml-file 
 replace.some -V
 
 I start to see what you see:
 
 (   xml.c:4985  )info: validate_with_relaxng: Creating RNG 
 parser context
 (  cib_file.c:268   )info: cib_file_perform_op_delegate:  cib_replace on 
 configuration
 ( cib_utils.c:338   )   trace: cib_perform_op:Begin cib_replace op
 (   xml.c:1487  )   trace: cib_perform_op:-- /configuration
 (   xml.c:1490  )   trace: cib_perform_op:+  cib epoch=2 
 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 
 crm_feature_set=3.0.9 cib-last-written=Fri Mar  7 13:24:07 2014 
 update-origin=vm01 update-client=crmd update-user=hacluster 
 have-quorum=1 dc-uuid=3232261507/
 (   xml.c:1490  )   trace: cib_perform_op:++   configuration
 (   xml.c:1490  )   trace: cib_perform_op:++ crm_config
 
 Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b ,

And now with improved change detection: 
https://github.com/beekhof/pacemaker/commit/6f364db

 but it looks like crmsh is doing something funny with its updates... does 
 anyone know what command it is running?


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-07 Thread Kristoffer Grönlund
On Fri, 07 Mar 2014 10:30:13 +0300
Vladislav Bogdanov bub...@hoster-ok.com wrote:

  Andrew, current git master (ee094a2) almost works, the only issue
  is that crm_diff calculates incorrect diff digest. If I replace
  digest in diff by hands with what cib calculates as expected. it
  applies correctly. Otherwise - -206.  
  
  More details?  
 
 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?

No, that commit fixes an issue when importing the CIB into crmsh, the
diff calculation happens when going the other way. It seems strange
that crmsh should be causing such a problem, all it does is call
crm_diff to generate the actual diff so any problem with an incorrect
digest should be coming from crm_diff.

I don't think this is an issue that is known to me, it doesn't sound
like it is the same problem I have been investigating. Could you file a
bug at https://savannah.nongnu.org/bugs/?group=crmsh with some more
details?

Thank you,

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-06 Thread Vladislav Bogdanov
18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all

 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2

 All nodes are KVM virtual machines.

  stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.

 The log of Retransmit List: is then outputted in large quantities from 
 corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many nodes.
 I can easily imagine this staving corosync of resources and causing breakage.
 
 I would _highly_ recommend retesting with the current git master of pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.

Andrew, current git master (ee094a2) almost works, the only issue is
that crm_diff calculates incorrect diff digest. If I replace digest in
diff by hands with what cib calculates as expected. it applies
correctly. Otherwise - -206.

 
 I'd be interested to hear your feedback.
 

 What is the reason which the node in which failure has not occurred carries 
 out lost?

 Please advise, if there is a problem in a setup in something.

 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing

 Regards,
 Yusuke
 -- 
  
 METRO SYSTEMS CO., LTD 

 Yusuke Iida 
 Mail: yusk.i...@gmail.com
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-06 Thread Kristoffer Grönlund
On Thu, 06 Mar 2014 14:39:46 +0300
Vladislav Bogdanov bub...@hoster-ok.com wrote:

  Probably best to poke the corosync guys about this.
  
  However, = .11 is known to cause significant CPU usage with that
  many nodes. I can easily imagine this staving corosync of resources
  and causing breakage.
  
  I would _highly_ recommend retesting with the current git master of
  pacemaker. I merged the new cib code last week which is faster by
  _two_ orders of magnitude and uses significantly less CPU.  
 
 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.

Ah! This sounds like the same issue that I am seeing with crmsh.

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-06 Thread Andrew Beekhof

On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities from 
 corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many nodes.
 I can easily imagine this staving corosync of resources and causing breakage.
 
 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.
 
 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.

More details?

 
 
 I'd be interested to hear your feedback.
 
 
 What is the reason which the node in which failure has not occurred carries 
 out lost?
 
 Please advise, if there is a problem in a setup in something.
 
 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
 
 Regards,
 Yusuke
 -- 
  
 METRO SYSTEMS CO., LTD 
 
 Yusuke Iida 
 Mail: yusk.i...@gmail.com
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-06 Thread Yusuke Iida
Hi, Andrew
2014-03-07 11:43 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 I don't understand... crm_mon doesn't look for changes to resources or 
 constraints and it should already be using the new faster diff format.

 [/me reads attachment]

 Ah, but perhaps I do understand afterall :-)

 This is repeated over and over:

   notice: crm_diff_update:  [cib_diff_notify] Patch aborted: Application 
 of an update diff failed (-206)
   notice: xml_patch_version_check:  Current num_updates is too high (885 
  67)

 That would certainly drive up CPU usage and cause crm_mon to get left behind.
 Happily the fix for that should be: 
 https://github.com/beekhof/pacemaker/commit/6c33820

I think that refreshment of cib is no longer repeated when a version
has a difference.
Thank you cope.

Now, I see another problem.

If crm configure load update is performed, with crm_mon started,
information will no longer be displayed.
Information will be displayed if crm_mon is restarted.

I executed the following commands and took the log of crm_mon.
# crm_mon --disable-ncurses -VV crm_mon.log 21

I am observing the cib information inside crm_mon after load was performed.

Two configuration sections exist in cib after load.

It seems that this is the next processing, and it remains since it
failed in deletion of the configuration section.
   trace: cib_native_dispatch_internal: cib-reply
change operation=delete path=/configuration/

A little following is the debugging log acquired by old pacemaker.
It is not found in order that (null)  may try to look for
path=/configuration from the document tree of top.
Should not path be path=/cib/configuration essentially?

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   (null)
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cib
epoch=2 num_updates=6 admin_epoch=0
validate-with=pacemaker-1.2 crm_feature_set=3.0.9
cib-last-written=Tue Mar  4 11:32:36 2014
update-origin=rhel64rpmbuild update-client=crmd have-quorum=1
dc-uuid=3232261524
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   configuration
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: crm_config
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
cluster_property_set id=cib-bootstrap-options
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
nvpair id=cib-bootstrap-options-dc-version name=dc-version
value=1.1.10-2dbaf19/
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
nvpair id=cib-bootstrap-options-cluster-infrastructure
name=cluster-infrastructure value=corosync/
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
/cluster_property_set
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /crm_config
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nodes
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
node id=3232261524 uname=rhel64rpmbuild/
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /nodes
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: resources/
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: constraints/
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /configuration
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   status
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
node_state id=3232261524 uname=rhel64rpmbuild in_ccm=true
crmd=online crm-debug-origin=do_state_transition join=member
expected=member
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
lrm id=3232261524
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
lrm_resources/
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /lrm
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
transient_attributes id=3232261524
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
instance_attributes id=status-3232261524
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
nvpair id=status-3232261524-shutdown name=shutdown value=0/
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
nvpair id=status-3232261524-probe_complete name=probe_complete
value=true/
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
/instance_attributes
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
/transient_attributes
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /node_state
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /status
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cib
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   /(null)


Is this the already recognized problem?

I attach the report at the time of this occurring, and the log of crm_mon.

- crm_report
https://drive.google.com/file/d/0BwMFJItoO-fVWEw4Qnp0aHIzSm8/edit?usp=sharing
- crm_mon.log
https://drive.google.com/file/d/0BwMFJItoO-fVRDRMTGtUUEdBc1E/edit?usp=sharing

Regards,
Yusuke


-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-06 Thread Vladislav Bogdanov
07.03.2014 05:43, Andrew Beekhof wrote:
 
 On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 18.02.2014 03:49, Andrew Beekhof wrote:

 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, all

 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2

 All nodes are KVM virtual machines.

 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.

 The log of Retransmit List: is then outputted in large quantities from 
 corosync.

 Probably best to poke the corosync guys about this.

 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.

 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.

 Andrew, current git master (ee094a2) almost works, the only issue is
 that crm_diff calculates incorrect diff digest. If I replace digest in
 diff by hands with what cib calculates as expected. it applies
 correctly. Otherwise - -206.
 
 More details?

Hmmm...
seems to be crmsh-specific,
Cannot reproduce with pure-XML editing.
Kristoffer, does 
http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?


 


 I'd be interested to hear your feedback.


 What is the reason which the node in which failure has not occurred 
 carries out lost?

 Please advise, if there is a problem in a setup in something.

 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing

 Regards,
 Yusuke
 -- 
  
 METRO SYSTEMS CO., LTD 

 Yusuke Iida 
 Mail: yusk.i...@gmail.com
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-02-20 Thread Andrew Beekhof

On 20 Feb 2014, at 6:06 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 
 I tested in the following environments.
 
 KVM virtual 16 machines
 CPU: 1
 memory: 2048MB
 OS: RHEL6.4
 Pacemaker-1.1.11(709b36b)
 corosync-2.3.2
 libqb-0.16.0
 
 It looks like performance is much better on the whole.
 
 However, the problem to which queue overflows with some nodes during
 the test of 16 nodes arose.
 It happened by vm01 and vm09.
 
 Overflow of queue of vm01 has taken place between cib and crm_mon.
 eb 20 14:21:02 [16211] vm01cib: (   ipc.c:506   )   trace:
 crm_ipcs_flush_events:  Sent 40 events (729 remaining) for
 0x1cd1850[16243]: Resource temporarily unavailable (-11)
 Feb 20 14:21:02 [16211] vm01cib: (   ipc.c:515   )
 error: crm_ipcs_flush_events:  Evicting slow client 0x1cd1850[16243]:
 event queue reached 729 entries

Who was pid 16243?
Doesn't look like a pacemaker daemon.

 
 Overflow of queue of vm09 has taken place between cib and stonithd.
 Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:506   )
 trace: crm_ipcs_flush_events:  Sent 36 events (530 remaining) for
 0x105ec10[15520]: Resource temporarily unavailable (-11)
 Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:515   )
 error: crm_ipcs_flush_events:  Evicting slow client 0x105ec10[15520]:
 event queue reached 530 entries
 
 Although I checked the code of the problem part, it was not understood
 by which it would be solved.
 
 Is it less likelihood of sending a message of 100 at a time?
 Does calculation of the waiting time after message transmission have a 
 problem?
 Threshold of 500 may be too low?

being 500 behind is really quite a long way.

 
 I attach crm_report when a problem occurs.
 https://drive.google.com/file/d/0BwMFJItoO-fVeGZuWkFnZTFWTDQ/edit?usp=sharing
 
 Regards,
 Yusuke
 2014-02-18 19:53 GMT+09:00 yusuke iida yusk.i...@gmail.com:
 Hi, Andrew and Digimer
 
 Thank you for the comment.
 
 I solved with reference to other mailing list about this problem.
 https://bugzilla.redhat.com/show_bug.cgi?id=880035
 
 It seems that the kernel of my environment was old when said from the
 conclusion.
 It updated to the newest kernel now.
 kernel-2.6.32-431.5.1.el6.x86_64.rpm
 
 The following parameters are set to bridge which is letting
 communication of corosync pass now.
 As a result, Retransmit List no longer occur almost.
 # echo 1  /sys/class/net/bridge/bridge/multicast_querier
 # echo 0  /sys/class/net/bridge/bridge/multicast_snooping
 
 2014-02-18 9:49 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities from 
 corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.
 
 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.
 
 I'd be interested to hear your feedback.
 Since I am very interested in this, I would like to test, although the
 problem of Retransmit List was solved.
 Please wait for a result a little.
 
 Thanks,
 Yusuke
 
 
 
 What is the reason which the node in which failure has not occurred 
 carries out lost?
 
 Please advise, if there is a problem in a setup in something.
 
 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
 
 Regards,
 Yusuke
 --
 
 METRO SYSTEMS CO., LTD
 
 Yusuke Iida
 Mail: yusk.i...@gmail.com
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 
 --
 
 METRO SYSTEMS CO., LTD
 
 Yusuke Iida
 Mail: yusk.i...@gmail.com
 

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-02-20 Thread yusuke iida
Hi, Andrew

2014-02-20 17:28 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 Who was pid 16243?
 Doesn't look like a pacemaker daemon.
pid 16243 is crm_mon.
In vm01, crm_mon was started and the state was checked.

If there is information required for analysis to other, I get it.

Regards,
Yusuke


 Overflow of queue of vm09 has taken place between cib and stonithd.
 Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:506   )
 trace: crm_ipcs_flush_events:  Sent 36 events (530 remaining) for
 0x105ec10[15520]: Resource temporarily unavailable (-11)
 Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:515   )
 error: crm_ipcs_flush_events:  Evicting slow client 0x105ec10[15520]:
 event queue reached 530 entries

 Although I checked the code of the problem part, it was not understood
 by which it would be solved.

 Is it less likelihood of sending a message of 100 at a time?
 Does calculation of the waiting time after message transmission have a 
 problem?
 Threshold of 500 may be too low?

 being 500 behind is really quite a long way.




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-02-20 Thread Andrew Beekhof

On 20 Feb 2014, at 8:39 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, Andrew
 
 2014-02-20 17:28 GMT+09:00 Andrew Beekhof and...@beekhof.net:
 Who was pid 16243?
 Doesn't look like a pacemaker daemon.
 pid 16243 is crm_mon.

That means that the state displayed by crm_mon was  500 updates behind.
At that point, what its displaying is horribly out of date and evicting it 
seems like a pretty good idea.

 In vm01, crm_mon was started and the state was checked.
 
 If there is information required for analysis to other, I get it.

Some idea of what crm_mon is doing would be a good start.
Adding a few -V options in addition to --disable-ncurses might be the best 
approach.

 
 Regards,
 Yusuke
 
 
 Overflow of queue of vm09 has taken place between cib and stonithd.
 Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:506   )
 trace: crm_ipcs_flush_events:  Sent 36 events (530 remaining) for
 0x105ec10[15520]: Resource temporarily unavailable (-11)
 Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:515   )
 error: crm_ipcs_flush_events:  Evicting slow client 0x105ec10[15520]:
 event queue reached 530 entries
 
 Although I checked the code of the problem part, it was not understood
 by which it would be solved.
 
 Is it less likelihood of sending a message of 100 at a time?
 Does calculation of the waiting time after message transmission have a 
 problem?
 Threshold of 500 may be too low?
 
 being 500 behind is really quite a long way.
 
 
 
 
 -- 
 
 METRO SYSTEMS CO., LTD
 
 Yusuke Iida
 Mail: yusk.i...@gmail.com
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-02-18 Thread Vladislav Bogdanov
18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all

 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2

 All nodes are KVM virtual machines.

  stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.

 The log of Retransmit List: is then outputted in large quantities from 
 corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many nodes.
 I can easily imagine this staving corosync of resources and causing breakage.
 
 I would _highly_ recommend retesting with the current git master of pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.

Andrew, you mean your cib-performance branch, am I correct?

Unfortunately it is not in .11 (sorry if I overlooked it there), and
even not in Clusterlabs/master yet and seems to be merged and then
reverted in beekhof/master...


 
 I'd be interested to hear your feedback.
 

 What is the reason which the node in which failure has not occurred carries 
 out lost?

 Please advise, if there is a problem in a setup in something.

 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing

 Regards,
 Yusuke
 -- 
  
 METRO SYSTEMS CO., LTD 

 Yusuke Iida 
 Mail: yusk.i...@gmail.com
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-02-18 Thread Andrew Beekhof

On 18 Feb 2014, at 7:40 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities from 
 corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many nodes.
 I can easily imagine this staving corosync of resources and causing breakage.
 
 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.
 
 Andrew, you mean your cib-performance branch, am I correct?

Yes

 
 Unfortunately it is not in .11

Intentionally so :)

 (sorry if I overlooked it there), and
 even not in Clusterlabs/master yet and seems to be merged and then
 reverted in beekhof/master...

This has just been brought to my attention :-(

https://github.com/beekhof/pacemaker/commit/1d98f6fd9eb76bd2498bc6356a3aa6e91a8a70e4#commitcomment-5405620

Give me a few minutes and i'll correct it

 
 
 
 I'd be interested to hear your feedback.
 
 
 What is the reason which the node in which failure has not occurred carries 
 out lost?
 
 Please advise, if there is a problem in a setup in something.
 
 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
 
 Regards,
 Yusuke
 -- 
  
 METRO SYSTEMS CO., LTD 
 
 Yusuke Iida 
 Mail: yusk.i...@gmail.com
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-02-18 Thread Andrew Beekhof

On 18 Feb 2014, at 8:18 pm, Andrew Beekhof and...@beekhof.net wrote:

 
 On 18 Feb 2014, at 7:40 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote:
 
 18.02.2014 03:49, Andrew Beekhof wrote:
 
 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:
 
 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
 stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities from 
 corosync.
 
 Probably best to poke the corosync guys about this.
 
 However, = .11 is known to cause significant CPU usage with that many 
 nodes.
 I can easily imagine this staving corosync of resources and causing 
 breakage.
 
 I would _highly_ recommend retesting with the current git master of 
 pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.
 
 Andrew, you mean your cib-performance branch, am I correct?
 
 Yes
 
 
 Unfortunately it is not in .11
 
 Intentionally so :)
 
 (sorry if I overlooked it there), and
 even not in Clusterlabs/master yet and seems to be merged and then
 reverted in beekhof/master...
 
 This has just been brought to my attention :-(
 
 https://github.com/beekhof/pacemaker/commit/1d98f6fd9eb76bd2498bc6356a3aa6e91a8a70e4#commitcomment-5405620
 
 Give me a few minutes and i'll correct it

Ok, i've force pushed an tree without the above screwup.
I'll merge into ClusterLabs tomorrow

 
 
 
 
 I'd be interested to hear your feedback.
 
 
 What is the reason which the node in which failure has not occurred 
 carries out lost?
 
 Please advise, if there is a problem in a setup in something.
 
 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
 
 Regards,
 Yusuke
 -- 
  
 METRO SYSTEMS CO., LTD 
 
 Yusuke Iida 
 Mail: yusk.i...@gmail.com
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-02-18 Thread yusuke iida
Hi, Andrew and Digimer

Thank you for the comment.

I solved with reference to other mailing list about this problem.
https://bugzilla.redhat.com/show_bug.cgi?id=880035

It seems that the kernel of my environment was old when said from the
conclusion.
It updated to the newest kernel now.
kernel-2.6.32-431.5.1.el6.x86_64.rpm

The following parameters are set to bridge which is letting
communication of corosync pass now.
As a result, Retransmit List no longer occur almost.
# echo 1  /sys/class/net/bridge/bridge/multicast_querier
# echo 0  /sys/class/net/bridge/bridge/multicast_snooping

2014-02-18 9:49 GMT+09:00 Andrew Beekhof and...@beekhof.net:

 On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, all

 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2

 All nodes are KVM virtual machines.

  stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are 
 separated from a cluster.

 The log of Retransmit List: is then outputted in large quantities from 
 corosync.

 Probably best to poke the corosync guys about this.

 However, = .11 is known to cause significant CPU usage with that many nodes.
 I can easily imagine this staving corosync of resources and causing breakage.

 I would _highly_ recommend retesting with the current git master of pacemaker.
 I merged the new cib code last week which is faster by _two_ orders of 
 magnitude and uses significantly less CPU.

 I'd be interested to hear your feedback.
Since I am very interested in this, I would like to test, although the
problem of Retransmit List was solved.
Please wait for a result a little.

Thanks,
Yusuke



 What is the reason which the node in which failure has not occurred carries 
 out lost?

 Please advise, if there is a problem in a setup in something.

 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing

 Regards,
 Yusuke
 --
 
 METRO SYSTEMS CO., LTD

 Yusuke Iida
 Mail: yusk.i...@gmail.com
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-02-17 Thread Andrew Beekhof

On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote:

 Hi, all
 
 I measure the performance of Pacemaker in the following combinations.
 Pacemaker-1.1.11.rc1
 libqb-0.16.0
 corosync-2.3.2
 
 All nodes are KVM virtual machines.
 
  stopped the node of vm01 compulsorily from the inside, after starting 14 
 nodes.
 virsh destroy vm01 was used for the stop.
 Then, in addition to the compulsorily stopped node, other nodes are separated 
 from a cluster.
 
 The log of Retransmit List: is then outputted in large quantities from 
 corosync.

Probably best to poke the corosync guys about this.

However, = .11 is known to cause significant CPU usage with that many nodes.
I can easily imagine this staving corosync of resources and causing breakage.

I would _highly_ recommend retesting with the current git master of pacemaker.
I merged the new cib code last week which is faster by _two_ orders of 
magnitude and uses significantly less CPU.

I'd be interested to hear your feedback.

 
 What is the reason which the node in which failure has not occurred carries 
 out lost?
 
 Please advise, if there is a problem in a setup in something.
 
 I attached the report when the problem occurred.
 https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
 
 Regards,
 Yusuke
 -- 
  
 METRO SYSTEMS CO., LTD 
 
 Yusuke Iida 
 Mail: yusk.i...@gmail.com
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org