Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew crm_mon has the processing which makes cib the newest, when pcmk_err_old_data is still received. Since this processing can be considered to be unnecessary like the processing changed by stonithd, I correct this. Please merge the following, if satisfactory. https://github.com/ClusterLabs/pacemaker/pull/477 Regards, Yusuke 2014-03-18 9:56 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 12 Mar 2014, at 1:45 pm, Yusuke Iida yusk.i...@gmail.com wrote: Hi, Andrew 2014-03-12 6:37 GMT+09:00 Andrew Beekhof and...@beekhof.net: Mar 07 13:24:14 [2528] vm01 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib 0xf91c10, configuration Thats interesting... is that with the fixes mentioned above? I'm sorry. The above-mentioned log is not outputted by the newest Pacemaker. The following logs come out in the newest thing. Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:377 ) trace: te_update_diff: Handling create operation for /cib/configuration 0x1c37c60, fencing-topology Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib/configuration 0x1c37c60, fencing-topology Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:377 ) trace: te_update_diff: Handling create operation for /cib/configuration 0x1c397a0, rsc_defaults Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib/configuration 0x1c397a0, rsc_defaults I checked code of te_update_diff. Should not the next judgment be changed if change of fencing-topology or rsc_defaults is processed as a configuration subordinate's change? Perfect! https://github.com/beekhof/pacemaker/commit/1c285ac Thanks to everyone for giving the new CIB a pounding, we should be in very good shape for a release soon :-) diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c index dd57660..f97bab5 100644 --- a/crmd/te_callbacks.c +++ b/crmd/te_callbacks.c @@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg) if(xpath == NULL) { /* Version field, ignore */ -} else if(strstr(xpath, /cib/configuration/)) { +} else if(strstr(xpath, /cib/configuration)) { abort_transition(INFINITY, tg_restart, Non-status change, change); } else if(strstr(xpath, /XML_CIB_TAG_TICKETS[) || safe_str_eq(name, XML_CIB_TAG_TICKETS)) { How is such change? I attach report at this time. The trace log of te_update_diff is also contained. https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing Regards, Yusuke but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? The execution result of the following commands remained in /var/log/messages. Mar 7 13:24:14 vm01 cibadmin[2555]: notice: crm_log_args: Invoked: cibadmin -p -R --force I'm somewhat confused at this point if crmsh is using --replace, then why is it doing diff calculations? Or are replace operations only for the load operation? -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 12 Mar 2014, at 1:45 pm, Yusuke Iida yusk.i...@gmail.com wrote: Hi, Andrew 2014-03-12 6:37 GMT+09:00 Andrew Beekhof and...@beekhof.net: Mar 07 13:24:14 [2528] vm01 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib 0xf91c10, configuration Thats interesting... is that with the fixes mentioned above? I'm sorry. The above-mentioned log is not outputted by the newest Pacemaker. The following logs come out in the newest thing. Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:377 ) trace: te_update_diff: Handling create operation for /cib/configuration 0x1c37c60, fencing-topology Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib/configuration 0x1c37c60, fencing-topology Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:377 ) trace: te_update_diff: Handling create operation for /cib/configuration 0x1c397a0, rsc_defaults Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib/configuration 0x1c397a0, rsc_defaults I checked code of te_update_diff. Should not the next judgment be changed if change of fencing-topology or rsc_defaults is processed as a configuration subordinate's change? Perfect! https://github.com/beekhof/pacemaker/commit/1c285ac Thanks to everyone for giving the new CIB a pounding, we should be in very good shape for a release soon :-) diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c index dd57660..f97bab5 100644 --- a/crmd/te_callbacks.c +++ b/crmd/te_callbacks.c @@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg) if(xpath == NULL) { /* Version field, ignore */ -} else if(strstr(xpath, /cib/configuration/)) { +} else if(strstr(xpath, /cib/configuration)) { abort_transition(INFINITY, tg_restart, Non-status change, change); } else if(strstr(xpath, /XML_CIB_TAG_TICKETS[) || safe_str_eq(name, XML_CIB_TAG_TICKETS)) { How is such change? I attach report at this time. The trace log of te_update_diff is also contained. https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing Regards, Yusuke but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? The execution result of the following commands remained in /var/log/messages. Mar 7 13:24:14 vm01 cibadmin[2555]: notice: crm_log_args: Invoked: cibadmin -p -R --force I'm somewhat confused at this point if crmsh is using --replace, then why is it doing diff calculations? Or are replace operations only for the load operation? -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
12.03.2014 00:40, Andrew Beekhof wrote: On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 07.03.2014 10:30, Vladislav Bogdanov wrote: 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? The problem seems to be caused by the fact that crmsh does not provide status section in both orig and new XMLs to crm_diff, and digest generation seems to rely on that, so crm_diff and cib daemon produce different digests. Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) are related to the full CIB operation (with status section included), another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that section removed like crmsh does do. Resulting diffs differ only by digest, and that seems to be the exact issue. This should help. As long as crmsh isn't passing -c to crm_diff, then the digest will no longer be present. https://github.com/beekhof/pacemaker/commit/c8d443d Yep, that helped. Thank you! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
07.03.2014 10:30, Vladislav Bogdanov wrote: 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? The problem seems to be caused by the fact that crmsh does not provide status section in both orig and new XMLs to crm_diff, and digest generation seems to rely on that, so crm_diff and cib daemon produce different digests. Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) are related to the full CIB operation (with status section included), another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that section removed like crmsh does do. Resulting diffs differ only by digest, and that seems to be the exact issue. cib epoch=4 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ nvpair name=symmetric-cluster value=true id=cib-bootstrap-options-symmetric-cluster/ /cluster_property_set /crm_config nodes node id=1 uname=booter-0/ node id=2 uname=booter-1/ /nodes resources/ constraints/ /configuration status node_state id=1 uname=booter-0 in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member expected=member lrm id=1 lrm_resources/ /lrm transient_attributes id=1 instance_attributes id=status-1 nvpair id=status-1-shutdown name=shutdown value=0/ nvpair id=status-1-probe_complete name=probe_complete value=true/ /instance_attributes /transient_attributes /node_state /status /cib cib epoch=4 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ nvpair name=symmetric-cluster value=true id=cib-bootstrap-options-symmetric-cluster/ /cluster_property_set /crm_config nodes node id=1 uname=booter-0/ node id=2 uname=booter-1/ /nodes resources/ constraints/ /configuration /cib cib epoch=3 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ /cluster_property_set /crm_config nodes node id=1 uname=booter-0/ node id=2 uname=booter-1/ /nodes resources/ constraints/ /configuration status node_state id=1 uname=booter-0 in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew 2014-03-11 14:21 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote: [snip] If I do this however: # cp start.xml 1.xml; tools/cibadmin --replace -o configuration --xml-file replace.some -V I start to see what you see: ( xml.c:4985 )info: validate_with_relaxng: Creating RNG parser context ( cib_file.c:268 )info: cib_file_perform_op_delegate: cib_replace on configuration ( cib_utils.c:338 ) trace: cib_perform_op:Begin cib_replace op ( xml.c:1487 ) trace: cib_perform_op:-- /configuration ( xml.c:1490 ) trace: cib_perform_op:+ cib epoch=2 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Fri Mar 7 13:24:07 2014 update-origin=vm01 update-client=crmd update-user=hacluster have-quorum=1 dc-uuid=3232261507/ ( xml.c:1490 ) trace: cib_perform_op:++ configuration ( xml.c:1490 ) trace: cib_perform_op:++ crm_config Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b , And now with improved change detection: https://github.com/beekhof/pacemaker/commit/6f364db I checked that the problem as which crm_mon does not display updating had been solved. BTW, The following logs came to come out recently. Although it seems that there is no problem in operation, when the following logs have come out, are there any problems? Mar 07 13:24:14 [2528] vm01 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib 0xf91c10, configuration but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? The execution result of the following commands remained in /var/log/messages. Mar 7 13:24:14 vm01 cibadmin[2555]: notice: crm_log_args: Invoked: cibadmin -p -R --force I am using crmsh-1.2.6-rc3. Thanks, Yusuke ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 11 Mar 2014, at 6:51 pm, Yusuke Iida yusk.i...@gmail.com wrote: Hi, Andrew 2014-03-11 14:21 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote: [snip] If I do this however: # cp start.xml 1.xml; tools/cibadmin --replace -o configuration --xml-file replace.some -V I start to see what you see: ( xml.c:4985 )info: validate_with_relaxng: Creating RNG parser context ( cib_file.c:268 )info: cib_file_perform_op_delegate: cib_replace on configuration ( cib_utils.c:338 ) trace: cib_perform_op:Begin cib_replace op ( xml.c:1487 ) trace: cib_perform_op:-- /configuration ( xml.c:1490 ) trace: cib_perform_op:+ cib epoch=2 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Fri Mar 7 13:24:07 2014 update-origin=vm01 update-client=crmd update-user=hacluster have-quorum=1 dc-uuid=3232261507/ ( xml.c:1490 ) trace: cib_perform_op:++ configuration ( xml.c:1490 ) trace: cib_perform_op:++ crm_config Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b , And now with improved change detection: https://github.com/beekhof/pacemaker/commit/6f364db I checked that the problem as which crm_mon does not display updating had been solved. BTW, The following logs came to come out recently. Although it seems that there is no problem in operation, when the following logs have come out, are there any problems? Mar 07 13:24:14 [2528] vm01 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib 0xf91c10, configuration Thats interesting... is that with the fixes mentioned above? but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? The execution result of the following commands remained in /var/log/messages. Mar 7 13:24:14 vm01 cibadmin[2555]: notice: crm_log_args: Invoked: cibadmin -p -R --force I'm somewhat confused at this point if crmsh is using --replace, then why is it doing diff calculations? Or are replace operations only for the load operation? I am using crmsh-1.2.6-rc3. Thanks, Yusuke ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 07.03.2014 10:30, Vladislav Bogdanov wrote: 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? The problem seems to be caused by the fact that crmsh does not provide status section in both orig and new XMLs to crm_diff, and digest generation seems to rely on that, so crm_diff and cib daemon produce different digests. Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) are related to the full CIB operation (with status section included), another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that section removed like crmsh does do. Resulting diffs differ only by digest, and that seems to be the exact issue. This should help. As long as crmsh isn't passing -c to crm_diff, then the digest will no longer be present. https://github.com/beekhof/pacemaker/commit/c8d443d signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 12 Mar 2014, at 8:40 am, Andrew Beekhof and...@beekhof.net wrote: On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 07.03.2014 10:30, Vladislav Bogdanov wrote: 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? The problem seems to be caused by the fact that crmsh does not provide status section in both orig and new XMLs to crm_diff, and digest generation seems to rely on that, so crm_diff and cib daemon produce different digests. Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) are related to the full CIB operation (with status section included), another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that section removed like crmsh does do. Resulting diffs differ only by digest, and that seems to be the exact issue. This should help. As long as crmsh isn't passing -c to crm_diff, then the digest will no longer be present. https://github.com/beekhof/pacemaker/commit/c8d443d Github seems to be doing something weird at the moment... here's the raw patch: commit c8d443d8d1604dde2727cf716951231ed05926e4 Author: Andrew Beekhof and...@beekhof.net Date: Wed Mar 12 08:38:58 2014 +1100 Fix: crm_diff: Allow the generation of xml patchsets without digests diff --git a/tools/xml_diff.c b/tools/xml_diff.c index c8673b9..b98859e 100644 --- a/tools/xml_diff.c +++ b/tools/xml_diff.c @@ -199,7 +199,7 @@ main(int argc, char **argv) xml_calculate_changes(object_1, object_2); crm_log_xml_debug(object_2, xml_file_2?xml_file_2:target); -output = xml_create_patchset(0, object_1, object_2, NULL, FALSE, TRUE); +output = xml_create_patchset(0, object_1, object_2, NULL, FALSE, as_cib); if(as_cib output) { int add[] = { 0, 0, 0 }; signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew 2014-03-12 6:37 GMT+09:00 Andrew Beekhof and...@beekhof.net: Mar 07 13:24:14 [2528] vm01 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib 0xf91c10, configuration Thats interesting... is that with the fixes mentioned above? I'm sorry. The above-mentioned log is not outputted by the newest Pacemaker. The following logs come out in the newest thing. Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:377 ) trace: te_update_diff: Handling create operation for /cib/configuration 0x1c37c60, fencing-topology Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib/configuration 0x1c37c60, fencing-topology Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:377 ) trace: te_update_diff: Handling create operation for /cib/configuration 0x1c397a0, rsc_defaults Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib/configuration 0x1c397a0, rsc_defaults I checked code of te_update_diff. Should not the next judgment be changed if change of fencing-topology or rsc_defaults is processed as a configuration subordinate's change? diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c index dd57660..f97bab5 100644 --- a/crmd/te_callbacks.c +++ b/crmd/te_callbacks.c @@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg) if(xpath == NULL) { /* Version field, ignore */ -} else if(strstr(xpath, /cib/configuration/)) { +} else if(strstr(xpath, /cib/configuration)) { abort_transition(INFINITY, tg_restart, Non-status change, change); } else if(strstr(xpath, /XML_CIB_TAG_TICKETS[) || safe_str_eq(name, XML_CIB_TAG_TICKETS)) { How is such change? I attach report at this time. The trace log of te_update_diff is also contained. https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing Regards, Yusuke but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? The execution result of the following commands remained in /var/log/messages. Mar 7 13:24:14 vm01 cibadmin[2555]: notice: crm_log_args: Invoked: cibadmin -p -R --force I'm somewhat confused at this point if crmsh is using --replace, then why is it doing diff calculations? Or are replace operations only for the load operation? -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
12.03.2014 00:37, Andrew Beekhof wrote: ... I'm somewhat confused at this point if crmsh is using --replace, then why is it doing diff calculations? Or are replace operations only for the load operation? It uses on of two methods depending on pacemaker version. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 7 Mar 2014, at 5:35 pm, Yusuke Iida yusk.i...@gmail.com wrote: Hi, Andrew 2014-03-07 11:43 GMT+09:00 Andrew Beekhof and...@beekhof.net: I don't understand... crm_mon doesn't look for changes to resources or constraints and it should already be using the new faster diff format. [/me reads attachment] Ah, but perhaps I do understand afterall :-) This is repeated over and over: notice: crm_diff_update: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) notice: xml_patch_version_check: Current num_updates is too high (885 67) That would certainly drive up CPU usage and cause crm_mon to get left behind. Happily the fix for that should be: https://github.com/beekhof/pacemaker/commit/6c33820 I think that refreshment of cib is no longer repeated when a version has a difference. Thank you cope. Now, I see another problem. If crm configure load update is performed, with crm_mon started, information will no longer be displayed. Information will be displayed if crm_mon is restarted. I executed the following commands and took the log of crm_mon. # crm_mon --disable-ncurses -VV crm_mon.log 21 I am observing the cib information inside crm_mon after load was performed. Two configuration sections exist in cib after load. It seems that this is the next processing, and it remains since it failed in deletion of the configuration section. trace: cib_native_dispatch_internal: cib-reply change operation=delete path=/configuration/ A little following is the debugging log acquired by old pacemaker. It is not found in order that (null) may try to look for path=/configuration from the document tree of top. Should not path be path=/cib/configuration essentially? Yes. Could you send me the cib as well as the update you're trying to load? notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: (null) notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cib epoch=2 num_updates=6 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Tue Mar 4 11:32:36 2014 update-origin=rhel64rpmbuild update-client=crmd have-quorum=1 dc-uuid=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: configuration notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: crm_config notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cluster_property_set id=cib-bootstrap-options notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.10-2dbaf19/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cluster_property_set notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /crm_config notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nodes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: node id=3232261524 uname=rhel64rpmbuild/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /nodes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: resources/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: constraints/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /configuration notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: status notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: node_state id=3232261524 uname=rhel64rpmbuild in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member expected=member notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: lrm id=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: lrm_resources/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /lrm notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: transient_attributes id=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: instance_attributes id=status-3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=status-3232261524-shutdown name=shutdown value=0/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=status-3232261524-probe_complete name=probe_complete value=true/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /instance_attributes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /transient_attributes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /node_state notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /status notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cib notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /(null) Is this the already recognized problem? I attach the report at the time of this occurring, and the log of crm_mon. - crm_report
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew I attach CLI file which loaded. Although loaded xml does not exist as a file, I think from a log that they are the following forms. This log is extracted from the following reports. https://drive.google.com/file/d/0BwMFJItoO-fVWEw4Qnp0aHIzSm8/edit?usp=sharing Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1365 )info: cib_perform_op: Diff: +++ 0.3.0 (null) Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1438 )info: cib_perform_op: -- /configuration Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1431 )info: cib_perform_op: + /cib: @epoch=3, @num_updates=0 Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1387 )info: cib_perform_op: ++ /cib: configuration/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ crm_config Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++cluster_property_set id=cib-bootstrap-options Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=no-quorum-policy value=ignore id=cib-bootstrap-options-no-quorum-policy/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=stonith-enabled value=false id=cib-bootstrap-options-stonith-enabled/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=startup-fencing value=false id=cib-bootstrap-options-startup-fencing/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=stonith-timeout value=60s id=cib-bootstrap-options-stonith-timeout/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=crmd-transition-delay value=2s id=cib-bootstrap-options-crmd-transition-delay/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++/cluster_property_set Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ /crm_config Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ nodes Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++node id=3232261508 uname=vm02/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++node id=3232261507 uname=vm01/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ /nodes Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ resources Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++primitive id=prmDummy class=ocf provider=heartbeat type=Dummy Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ !--#primitive prmDummy1 ocf:heartbeat:Dummy--/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ !--#location rsc_location-group1-1 group1 # rule 200: #uname eq vm01 # rule 100: #uname eq vm02--/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy-instance_attributes Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgctl value=/usr/bin/pg_ctl id=prmDummy-instance_attributes-pgctl/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=start_opt value=-p 5432 -h 192.168.xxx.xxx id=prmDummy-instance_attributes-start_opt/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=psql value=/usr/bin/psql id=prmDummy-instance_attributes-psql/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgdata value=/var/lib/pgsql/data id=prmDummy-instance_attributes-pgdata/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgdba value=postgres id=prmDummy-instance_attributes-pgdba/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgport value=5432 id=prmDummy-instance_attributes-pgport/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++nvpair name=pgdb value=template1 id=prmDummy-instance_attributes-pgdb/ Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ /instance_attributes Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info: cib_perform_op: ++ operations Mar 07 13:24:14 [2523] vm01cib: ( xml.c:1394 )info:
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
I tried replacing pe-input-2.bz2 with pe-input-3.bz2 and saw: # cp start.xml 1.xml; tools/cibadmin --replace --xml-file replace.xml -V ( cib_file.c:268 )info: cib_file_perform_op_delegate:cib_replace on (null) ( cib_utils.c:338 ) trace: cib_perform_op: Begin cib_replace op ( cib_ops.c:258 )info: cib_process_replace: Replaced 0.2.14 with 0.5.7 from (null) ( cib_utils.c:408 ) trace: cib_perform_op: Inferring changes after cib_replace op ( xml.c:3957 )info: __xml_diff_object: transient_attributes.3232261508 moved from 1 to 0 - 15 ( xml.c:3957 )info: __xml_diff_object: lrm.3232261508 moved from 0 to 1 - 7 ... ( xml.c:1363 )info: cib_perform_op: Diff: --- 0.2.14 2 ( xml.c:1365 )info: cib_perform_op: Diff: +++ 0.6.0 e89b8f8986ecf2dfd516fd48f1711fbf ( xml.c:1431 )info: cib_perform_op: + /cib: @epoch=6, @num_updates=0, @cib-last-written=Fri Mar 7 13:24:14 2014 ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=no-quorum-policy value=ignore id=cib-bootstrap-options-no-quorum-policy/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=stonith-enabled value=false id=cib-bootstrap-options-stonith-enabled/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=startup-fencing value=false id=cib-bootstrap-options-startup-fencing/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=stonith-timeout value=60s id=cib-bootstrap-options-stonith-timeout/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: nvpair name=crmd-transition-delay value=2s id=cib-bootstrap-options-crmd-transition-delay/ ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/resources: primitive id=prmDummy class=ocf provider=heartbeat type=Dummy/ ( xml.c:1394 )info: cib_perform_op: ++ !--#primitive prmDummy1 ocf:heartbeat:Dummy--/ ( xml.c:1394 )info: cib_perform_op: ++ !--#location rsc_location-group1-1 group1 #rule 200: #uname eq vm01 # rule 100: #uname eq vm02--/ ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy-instance_attributes ... ( xml.c:1394 )info: cib_perform_op: ++ /primitive ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/resources: primitive id=prmDummy2 class=ocf provider=heartbeat type=Dummy/ ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy2-instance_attributes ... ( xml.c:1394 )info: cib_perform_op: ++ /primitive ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/resources: primitive id=prmDummy3 class=ocf provider=heartbeat type=Dummy/ ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy3-instance_attributes ... ( xml.c:1394 )info: cib_perform_op: ++ /primitive ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration/resources: primitive id=prmDummy4 class=ocf provider=heartbeat type=Dummy/ ( xml.c:1394 )info: cib_perform_op: ++ instance_attributes id=prmDummy4-instance_attributes ... ( xml.c:1394 )info: cib_perform_op: ++ /primitive ( xml.c:1387 )info: cib_perform_op: ++ /cib/configuration: rsc_defaults/ ( xml.c:1394 )info: cib_perform_op: ++ meta_attributes id=rsc-options ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=resource-stickiness value=INFINITY id=rsc-options-resource-stickiness/ ( xml.c:1394 )info: cib_perform_op: ++ nvpair name=migration-threshold value=1 id=rsc-options-migration-threshold/ ( xml.c:1394 )info: cib_perform_op: ++ /meta_attributes ( xml.c:1394 )info: cib_perform_op: ++ /rsc_defaults ( xml.c:1387 )info: cib_perform_op: ++ /cib/status/node_state[@id='3232261508']/transient_attributes[@id='3232261508']/instance_attributes[@id='status-3232261508']: nvpair id=status-3232261508-shutdown name=shutdown value=0/ ( xml.c:1399 )info: cib_perform_op: +~
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote: [snip] If I do this however: # cp start.xml 1.xml; tools/cibadmin --replace -o configuration --xml-file replace.some -V I start to see what you see: ( xml.c:4985 )info: validate_with_relaxng: Creating RNG parser context ( cib_file.c:268 )info: cib_file_perform_op_delegate: cib_replace on configuration ( cib_utils.c:338 ) trace: cib_perform_op:Begin cib_replace op ( xml.c:1487 ) trace: cib_perform_op:-- /configuration ( xml.c:1490 ) trace: cib_perform_op:+ cib epoch=2 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Fri Mar 7 13:24:07 2014 update-origin=vm01 update-client=crmd update-user=hacluster have-quorum=1 dc-uuid=3232261507/ ( xml.c:1490 ) trace: cib_perform_op:++ configuration ( xml.c:1490 ) trace: cib_perform_op:++ crm_config Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b , And now with improved change detection: https://github.com/beekhof/pacemaker/commit/6f364db but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On Fri, 07 Mar 2014 10:30:13 +0300 Vladislav Bogdanov bub...@hoster-ok.com wrote: Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? No, that commit fixes an issue when importing the CIB into crmsh, the diff calculation happens when going the other way. It seems strange that crmsh should be causing such a problem, all it does is call crm_diff to generate the actual diff so any problem with an incorrect digest should be coming from crm_diff. I don't think this is an issue that is known to me, it doesn't sound like it is the same problem I have been investigating. Could you file a bug at https://savannah.nongnu.org/bugs/?group=crmsh with some more details? Thank you, -- // Kristoffer Grönlund // kgronl...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. I'd be interested to hear your feedback. What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On Thu, 06 Mar 2014 14:39:46 +0300 Vladislav Bogdanov bub...@hoster-ok.com wrote: Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. Ah! This sounds like the same issue that I am seeing with crmsh. -- // Kristoffer Grönlund // kgronl...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? I'd be interested to hear your feedback. What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew 2014-03-07 11:43 GMT+09:00 Andrew Beekhof and...@beekhof.net: I don't understand... crm_mon doesn't look for changes to resources or constraints and it should already be using the new faster diff format. [/me reads attachment] Ah, but perhaps I do understand afterall :-) This is repeated over and over: notice: crm_diff_update: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) notice: xml_patch_version_check: Current num_updates is too high (885 67) That would certainly drive up CPU usage and cause crm_mon to get left behind. Happily the fix for that should be: https://github.com/beekhof/pacemaker/commit/6c33820 I think that refreshment of cib is no longer repeated when a version has a difference. Thank you cope. Now, I see another problem. If crm configure load update is performed, with crm_mon started, information will no longer be displayed. Information will be displayed if crm_mon is restarted. I executed the following commands and took the log of crm_mon. # crm_mon --disable-ncurses -VV crm_mon.log 21 I am observing the cib information inside crm_mon after load was performed. Two configuration sections exist in cib after load. It seems that this is the next processing, and it remains since it failed in deletion of the configuration section. trace: cib_native_dispatch_internal: cib-reply change operation=delete path=/configuration/ A little following is the debugging log acquired by old pacemaker. It is not found in order that (null) may try to look for path=/configuration from the document tree of top. Should not path be path=/cib/configuration essentially? notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: (null) notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cib epoch=2 num_updates=6 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Tue Mar 4 11:32:36 2014 update-origin=rhel64rpmbuild update-client=crmd have-quorum=1 dc-uuid=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: configuration notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: crm_config notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: cluster_property_set id=cib-bootstrap-options notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.10-2dbaf19/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cluster_property_set notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /crm_config notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nodes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: node id=3232261524 uname=rhel64rpmbuild/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /nodes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: resources/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: constraints/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /configuration notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: status notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: node_state id=3232261524 uname=rhel64rpmbuild in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member expected=member notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: lrm id=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: lrm_resources/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /lrm notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: transient_attributes id=3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: instance_attributes id=status-3232261524 notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=status-3232261524-shutdown name=shutdown value=0/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: nvpair id=status-3232261524-probe_complete name=probe_complete value=true/ notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /instance_attributes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /transient_attributes notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /node_state notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /status notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /cib notice Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: /(null) Is this the already recognized problem? I attach the report at the time of this occurring, and the log of crm_mon. - crm_report https://drive.google.com/file/d/0BwMFJItoO-fVWEw4Qnp0aHIzSm8/edit?usp=sharing - crm_mon.log https://drive.google.com/file/d/0BwMFJItoO-fVRDRMTGtUUEdBc1E/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? I'd be interested to hear your feedback. What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 20 Feb 2014, at 6:06 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, Andrew I tested in the following environments. KVM virtual 16 machines CPU: 1 memory: 2048MB OS: RHEL6.4 Pacemaker-1.1.11(709b36b) corosync-2.3.2 libqb-0.16.0 It looks like performance is much better on the whole. However, the problem to which queue overflows with some nodes during the test of 16 nodes arose. It happened by vm01 and vm09. Overflow of queue of vm01 has taken place between cib and crm_mon. eb 20 14:21:02 [16211] vm01cib: ( ipc.c:506 ) trace: crm_ipcs_flush_events: Sent 40 events (729 remaining) for 0x1cd1850[16243]: Resource temporarily unavailable (-11) Feb 20 14:21:02 [16211] vm01cib: ( ipc.c:515 ) error: crm_ipcs_flush_events: Evicting slow client 0x1cd1850[16243]: event queue reached 729 entries Who was pid 16243? Doesn't look like a pacemaker daemon. Overflow of queue of vm09 has taken place between cib and stonithd. Feb 20 14:20:22 [15519] vm09cib: ( ipc.c:506 ) trace: crm_ipcs_flush_events: Sent 36 events (530 remaining) for 0x105ec10[15520]: Resource temporarily unavailable (-11) Feb 20 14:20:22 [15519] vm09cib: ( ipc.c:515 ) error: crm_ipcs_flush_events: Evicting slow client 0x105ec10[15520]: event queue reached 530 entries Although I checked the code of the problem part, it was not understood by which it would be solved. Is it less likelihood of sending a message of 100 at a time? Does calculation of the waiting time after message transmission have a problem? Threshold of 500 may be too low? being 500 behind is really quite a long way. I attach crm_report when a problem occurs. https://drive.google.com/file/d/0BwMFJItoO-fVeGZuWkFnZTFWTDQ/edit?usp=sharing Regards, Yusuke 2014-02-18 19:53 GMT+09:00 yusuke iida yusk.i...@gmail.com: Hi, Andrew and Digimer Thank you for the comment. I solved with reference to other mailing list about this problem. https://bugzilla.redhat.com/show_bug.cgi?id=880035 It seems that the kernel of my environment was old when said from the conclusion. It updated to the newest kernel now. kernel-2.6.32-431.5.1.el6.x86_64.rpm The following parameters are set to bridge which is letting communication of corosync pass now. As a result, Retransmit List no longer occur almost. # echo 1 /sys/class/net/bridge/bridge/multicast_querier # echo 0 /sys/class/net/bridge/bridge/multicast_snooping 2014-02-18 9:49 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. I'd be interested to hear your feedback. Since I am very interested in this, I would like to test, although the problem of Retransmit List was solved. Please wait for a result a little. Thanks, Yusuke What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew 2014-02-20 17:28 GMT+09:00 Andrew Beekhof and...@beekhof.net: Who was pid 16243? Doesn't look like a pacemaker daemon. pid 16243 is crm_mon. In vm01, crm_mon was started and the state was checked. If there is information required for analysis to other, I get it. Regards, Yusuke Overflow of queue of vm09 has taken place between cib and stonithd. Feb 20 14:20:22 [15519] vm09cib: ( ipc.c:506 ) trace: crm_ipcs_flush_events: Sent 36 events (530 remaining) for 0x105ec10[15520]: Resource temporarily unavailable (-11) Feb 20 14:20:22 [15519] vm09cib: ( ipc.c:515 ) error: crm_ipcs_flush_events: Evicting slow client 0x105ec10[15520]: event queue reached 530 entries Although I checked the code of the problem part, it was not understood by which it would be solved. Is it less likelihood of sending a message of 100 at a time? Does calculation of the waiting time after message transmission have a problem? Threshold of 500 may be too low? being 500 behind is really quite a long way. -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 20 Feb 2014, at 8:39 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, Andrew 2014-02-20 17:28 GMT+09:00 Andrew Beekhof and...@beekhof.net: Who was pid 16243? Doesn't look like a pacemaker daemon. pid 16243 is crm_mon. That means that the state displayed by crm_mon was 500 updates behind. At that point, what its displaying is horribly out of date and evicting it seems like a pretty good idea. In vm01, crm_mon was started and the state was checked. If there is information required for analysis to other, I get it. Some idea of what crm_mon is doing would be a good start. Adding a few -V options in addition to --disable-ncurses might be the best approach. Regards, Yusuke Overflow of queue of vm09 has taken place between cib and stonithd. Feb 20 14:20:22 [15519] vm09cib: ( ipc.c:506 ) trace: crm_ipcs_flush_events: Sent 36 events (530 remaining) for 0x105ec10[15520]: Resource temporarily unavailable (-11) Feb 20 14:20:22 [15519] vm09cib: ( ipc.c:515 ) error: crm_ipcs_flush_events: Evicting slow client 0x105ec10[15520]: event queue reached 530 entries Although I checked the code of the problem part, it was not understood by which it would be solved. Is it less likelihood of sending a message of 100 at a time? Does calculation of the waiting time after message transmission have a problem? Threshold of 500 may be too low? being 500 behind is really quite a long way. -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, you mean your cib-performance branch, am I correct? Unfortunately it is not in .11 (sorry if I overlooked it there), and even not in Clusterlabs/master yet and seems to be merged and then reverted in beekhof/master... I'd be interested to hear your feedback. What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 18 Feb 2014, at 7:40 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, you mean your cib-performance branch, am I correct? Yes Unfortunately it is not in .11 Intentionally so :) (sorry if I overlooked it there), and even not in Clusterlabs/master yet and seems to be merged and then reverted in beekhof/master... This has just been brought to my attention :-( https://github.com/beekhof/pacemaker/commit/1d98f6fd9eb76bd2498bc6356a3aa6e91a8a70e4#commitcomment-5405620 Give me a few minutes and i'll correct it I'd be interested to hear your feedback. What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 18 Feb 2014, at 8:18 pm, Andrew Beekhof and...@beekhof.net wrote: On 18 Feb 2014, at 7:40 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, you mean your cib-performance branch, am I correct? Yes Unfortunately it is not in .11 Intentionally so :) (sorry if I overlooked it there), and even not in Clusterlabs/master yet and seems to be merged and then reverted in beekhof/master... This has just been brought to my attention :-( https://github.com/beekhof/pacemaker/commit/1d98f6fd9eb76bd2498bc6356a3aa6e91a8a70e4#commitcomment-5405620 Give me a few minutes and i'll correct it Ok, i've force pushed an tree without the above screwup. I'll merge into ClusterLabs tomorrow I'd be interested to hear your feedback. What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew and Digimer Thank you for the comment. I solved with reference to other mailing list about this problem. https://bugzilla.redhat.com/show_bug.cgi?id=880035 It seems that the kernel of my environment was old when said from the conclusion. It updated to the newest kernel now. kernel-2.6.32-431.5.1.el6.x86_64.rpm The following parameters are set to bridge which is letting communication of corosync pass now. As a result, Retransmit List no longer occur almost. # echo 1 /sys/class/net/bridge/bridge/multicast_querier # echo 0 /sys/class/net/bridge/bridge/multicast_snooping 2014-02-18 9:49 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. I'd be interested to hear your feedback. Since I am very interested in this, I would like to test, although the problem of Retransmit List was solved. Please wait for a result a little. Thanks, Yusuke What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. I'd be interested to hear your feedback. What is the reason which the node in which failure has not occurred carries out lost? Please advise, if there is a problem in a setup in something. I attached the report when the problem occurred. https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing Regards, Yusuke -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org