Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
07.03.2014 10:30, Vladislav Bogdanov wrote: 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? The problem seems to be caused by the fact that crmsh does not provide status section in both orig and new XMLs to crm_diff, and digest generation seems to rely on that, so crm_diff and cib daemon produce different digests. Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) are related to the full CIB operation (with status section included), another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that section removed like crmsh does do. Resulting diffs differ only by digest, and that seems to be the exact issue. cib epoch=4 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ nvpair name=symmetric-cluster value=true id=cib-bootstrap-options-symmetric-cluster/ /cluster_property_set /crm_config nodes node id=1 uname=booter-0/ node id=2 uname=booter-1/ /nodes resources/ constraints/ /configuration status node_state id=1 uname=booter-0 in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member expected=member lrm id=1 lrm_resources/ /lrm transient_attributes id=1 instance_attributes id=status-1 nvpair id=status-1-shutdown name=shutdown value=0/ nvpair id=status-1-probe_complete name=probe_complete value=true/ /instance_attributes /transient_attributes /node_state /status /cib cib epoch=4 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ nvpair name=symmetric-cluster value=true id=cib-bootstrap-options-symmetric-cluster/ /cluster_property_set /crm_config nodes node id=1 uname=booter-0/ node id=2 uname=booter-1/ /nodes resources/ constraints/ /configuration /cib cib epoch=3 num_updates=5 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Tue Mar 11 06:57:54 2014 update-origin=booter-0 update-client=crmd update-user=hacluster crm_feature_set=3.0.9 have-quorum=1 dc-uuid=1 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.1.11-1.3.el6-b75a9bd/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=corosync/ /cluster_property_set /crm_config nodes node id=1 uname=booter-0/ node id=2 uname=booter-1/ /nodes resources/ constraints/ /configuration status node_state id=1 uname=booter-0 in_ccm=true crmd=online crm-debug-origin=do_state_transition join=member
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew 2014-03-11 14:21 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote: [snip] If I do this however: # cp start.xml 1.xml; tools/cibadmin --replace -o configuration --xml-file replace.some -V I start to see what you see: ( xml.c:4985 )info: validate_with_relaxng: Creating RNG parser context ( cib_file.c:268 )info: cib_file_perform_op_delegate: cib_replace on configuration ( cib_utils.c:338 ) trace: cib_perform_op:Begin cib_replace op ( xml.c:1487 ) trace: cib_perform_op:-- /configuration ( xml.c:1490 ) trace: cib_perform_op:+ cib epoch=2 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Fri Mar 7 13:24:07 2014 update-origin=vm01 update-client=crmd update-user=hacluster have-quorum=1 dc-uuid=3232261507/ ( xml.c:1490 ) trace: cib_perform_op:++ configuration ( xml.c:1490 ) trace: cib_perform_op:++ crm_config Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b , And now with improved change detection: https://github.com/beekhof/pacemaker/commit/6f364db I checked that the problem as which crm_mon does not display updating had been solved. BTW, The following logs came to come out recently. Although it seems that there is no problem in operation, when the following logs have come out, are there any problems? Mar 07 13:24:14 [2528] vm01 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib 0xf91c10, configuration but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? The execution result of the following commands remained in /var/log/messages. Mar 7 13:24:14 vm01 cibadmin[2555]: notice: crm_log_args: Invoked: cibadmin -p -R --force I am using crmsh-1.2.6-rc3. Thanks, Yusuke ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
-Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, which was released approx. a year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old version. Could you please clarify a bit? :) Lars recommends 2.3.3 git tree. I might end up trying both, but just want to make sure I am not misunderstanding something badly. Thank you! HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1 0:13.59 ha_logd: read process 1315 hacluster 20 0 73892 2652
Re: [Pacemaker] Pacemaker/corosync freeze
On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. which was released approx. a year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old version. Could you please clarify a bit? :) Lars recommends 2.3.3 git tree. I might end up trying both, but just want to make sure I am not misunderstanding something badly. Thank you! HTOP show something like this (sorted by TIME+ descending): 1 [100.0%] Tasks: 59, 4 thr; 2 running 2 [| 0.7%] Load average: 1.00 0.99 1.02 Mem[ 165/994MB] Uptime: 1 day, 10:22:03 Swp[ 0/509MB] PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 921 root 20 0 188M 49220 33856 R 0.0 4.8 3h33:58 /usr/sbin/corosync 1277 snmp 20 0 45708 4248 1472 S 0.0 0.4 1:33.07 /usr/sbin/snmpd - Lsd -Lf /dev/null -u snmp -g snm 1311 hacluster 20 0 109M 16160 9640 S 0.0 1.6 1:12.71 /usr/lib/pacemaker/cib 1312 root 20 0 104M 7484 3780 S 0.0 0.7 0:38.06 /usr/lib/pacemaker/stonithd 1611 root -2 0 4408 2356 2000 S 0.0 0.2 0:24.15 /usr/sbin/watchdog 1316 hacluster 20 0 122M 9756 5924 S 0.0 1.0 0:22.62 /usr/lib/pacemaker/crmd 1313 root 20 0 81784 3800 2876 S 0.0 0.4 0:18.64 /usr/lib/pacemaker/lrmd 1314 hacluster 20 0 96616 4132 2604 S 0.0 0.4 0:16.01 /usr/lib/pacemaker/attrd 1309 root 20 0 104M 4804 2580 S 0.0 0.5 0:15.56 pacemakerd 1250 root 20 0 33000 1192 928 S 0.0 0.1
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 11 Mar 2014, at 6:51 pm, Yusuke Iida yusk.i...@gmail.com wrote: Hi, Andrew 2014-03-11 14:21 GMT+09:00 Andrew Beekhof and...@beekhof.net: On 11 Mar 2014, at 4:14 pm, Andrew Beekhof and...@beekhof.net wrote: [snip] If I do this however: # cp start.xml 1.xml; tools/cibadmin --replace -o configuration --xml-file replace.some -V I start to see what you see: ( xml.c:4985 )info: validate_with_relaxng: Creating RNG parser context ( cib_file.c:268 )info: cib_file_perform_op_delegate: cib_replace on configuration ( cib_utils.c:338 ) trace: cib_perform_op:Begin cib_replace op ( xml.c:1487 ) trace: cib_perform_op:-- /configuration ( xml.c:1490 ) trace: cib_perform_op:+ cib epoch=2 num_updates=14 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.9 cib-last-written=Fri Mar 7 13:24:07 2014 update-origin=vm01 update-client=crmd update-user=hacluster have-quorum=1 dc-uuid=3232261507/ ( xml.c:1490 ) trace: cib_perform_op:++ configuration ( xml.c:1490 ) trace: cib_perform_op:++ crm_config Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b , And now with improved change detection: https://github.com/beekhof/pacemaker/commit/6f364db I checked that the problem as which crm_mon does not display updating had been solved. BTW, The following logs came to come out recently. Although it seems that there is no problem in operation, when the following logs have come out, are there any problems? Mar 07 13:24:14 [2528] vm01 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib 0xf91c10, configuration Thats interesting... is that with the fixes mentioned above? but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? The execution result of the following commands remained in /var/log/messages. Mar 7 13:24:14 vm01 cibadmin[2555]: notice: crm_log_args: Invoked: cibadmin -p -R --force I'm somewhat confused at this point if crmsh is using --replace, then why is it doing diff calculations? Or are replace operations only for the load operation? I am using crmsh-1.2.6-rc3. Thanks, Yusuke ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 07.03.2014 10:30, Vladislav Bogdanov wrote: 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? The problem seems to be caused by the fact that crmsh does not provide status section in both orig and new XMLs to crm_diff, and digest generation seems to rely on that, so crm_diff and cib daemon produce different digests. Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) are related to the full CIB operation (with status section included), another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that section removed like crmsh does do. Resulting diffs differ only by digest, and that seems to be the exact issue. This should help. As long as crmsh isn't passing -c to crm_diff, then the digest will no longer be present. https://github.com/beekhof/pacemaker/commit/c8d443d signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
On 12 Mar 2014, at 8:40 am, Andrew Beekhof and...@beekhof.net wrote: On 11 Mar 2014, at 6:23 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 07.03.2014 10:30, Vladislav Bogdanov wrote: 07.03.2014 05:43, Andrew Beekhof wrote: On 6 Mar 2014, at 10:39 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: 18.02.2014 03:49, Andrew Beekhof wrote: On 31 Jan 2014, at 6:20 pm, yusuke iida yusk.i...@gmail.com wrote: Hi, all I measure the performance of Pacemaker in the following combinations. Pacemaker-1.1.11.rc1 libqb-0.16.0 corosync-2.3.2 All nodes are KVM virtual machines. stopped the node of vm01 compulsorily from the inside, after starting 14 nodes. virsh destroy vm01 was used for the stop. Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster. The log of Retransmit List: is then outputted in large quantities from corosync. Probably best to poke the corosync guys about this. However, = .11 is known to cause significant CPU usage with that many nodes. I can easily imagine this staving corosync of resources and causing breakage. I would _highly_ recommend retesting with the current git master of pacemaker. I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU. Andrew, current git master (ee094a2) almost works, the only issue is that crm_diff calculates incorrect diff digest. If I replace digest in diff by hands with what cib calculates as expected. it applies correctly. Otherwise - -206. More details? Hmmm... seems to be crmsh-specific, Cannot reproduce with pure-XML editing. Kristoffer, does http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this? The problem seems to be caused by the fact that crmsh does not provide status section in both orig and new XMLs to crm_diff, and digest generation seems to rely on that, so crm_diff and cib daemon produce different digests. Attached are two sets of XML files, one (orig.xml, new.xml, patch.xml) are related to the full CIB operation (with status section included), another (orig-edited.xml, new-edited.xml, patch-edited.xml) have that section removed like crmsh does do. Resulting diffs differ only by digest, and that seems to be the exact issue. This should help. As long as crmsh isn't passing -c to crm_diff, then the digest will no longer be present. https://github.com/beekhof/pacemaker/commit/c8d443d Github seems to be doing something weird at the moment... here's the raw patch: commit c8d443d8d1604dde2727cf716951231ed05926e4 Author: Andrew Beekhof and...@beekhof.net Date: Wed Mar 12 08:38:58 2014 +1100 Fix: crm_diff: Allow the generation of xml patchsets without digests diff --git a/tools/xml_diff.c b/tools/xml_diff.c index c8673b9..b98859e 100644 --- a/tools/xml_diff.c +++ b/tools/xml_diff.c @@ -199,7 +199,7 @@ main(int argc, char **argv) xml_calculate_changes(object_1, object_2); crm_log_xml_debug(object_2, xml_file_2?xml_file_2:target); -output = xml_create_patchset(0, object_1, object_2, NULL, FALSE, TRUE); +output = xml_create_patchset(0, object_1, object_2, NULL, FALSE, as_cib); if(as_cib output) { int add[] = { 0, 0, 0 }; signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] hangs pending
Sorry for the delay, sometimes it takes a while to rebuild the necessary context On 5 Mar 2014, at 4:42 pm, Andrey Groshev gre...@yandex.ru wrote: 05.03.2014, 04:04, Andrew Beekhof and...@beekhof.net: On 25 Feb 2014, at 8:30 pm, Andrey Groshev gre...@yandex.ru wrote: 21.02.2014, 12:04, Andrey Groshev gre...@yandex.ru: 21.02.2014, 05:53, Andrew Beekhof and...@beekhof.net: On 19 Feb 2014, at 7:53 pm, Andrey Groshev gre...@yandex.ru wrote: 19.02.2014, 09:49, Andrew Beekhof and...@beekhof.net: On 19 Feb 2014, at 4:18 pm, Andrey Groshev gre...@yandex.ru wrote: 19.02.2014, 09:08, Andrew Beekhof and...@beekhof.net: On 19 Feb 2014, at 4:00 pm, Andrey Groshev gre...@yandex.ru wrote: 19.02.2014, 06:48, Andrew Beekhof and...@beekhof.net: On 18 Feb 2014, at 11:05 pm, Andrey Groshev gre...@yandex.ru wrote: Hi, ALL and Andrew! Today is a good day - I killed a lot, and a lot of shooting at me. In general - I am happy (almost like an elephant) :) Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd. I killed them with different signals (4,6,11 and even 9). Behavior does not depend of number signal - it's good. If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good. But the behavior is different from killing various demons. Turned four groups: 1. corosync,cib - STONITH work 100%. Kill via any signals - call STONITH and reboot. 2. lrmd,crmd - strange behavior STONITH. Sometimes called STONITH - and the corresponding reaction. Sometimes restart daemon and restart resources with large delay MS:pgsql. One time after restart crmd - pgsql don't restart. 3. stonithd,attrd,pengine - not need STONITH This daemons simple restart, resources - stay running. 4. pacemakerd - nothing happens. And then I can kill any process of the third group. They do not restart. Generaly don't touch corosync,cib and maybe lrmd,crmd. What do you think about this? The main question of this topic - we decided. But this varied behavior - another big problem. ForgŠ¾t logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2 Which of the various conditions above do the logs cover? All various in day. Are you trying to torture me? Can you give me a rough idea what happened when? No, there is 8 processes on the 4th signal and repeats the experiments with unknown outcome :) Easier to conduct new experiments and individual new logs . Which variant is more interesting? The long delay in restarting pgsql. Everything else seems correct. He even don't tried start pgsql. In Logs tree the tests. kill -s4 lrmd pid. 1. STONITH 2. STONITH 3. hangs Its waiting on a value for default_ping_set It seems we're calling monitor for pingCheck but for some reason its not performing an update: # grep 2632.*lrmd.*pingCheck /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 active resources) Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 active resources) Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 active resources) Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19 Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0 Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ] Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ] Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20 Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_1:2816 - exited with rc=0 Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_1:2816:stderr [ -- empty -- ] Feb 19
Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff
On 8 Mar 2014, at 11:31 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote: I provoke power off of ovirteng01. Fencing agent works ok on ovirteng02 and reboots it. I stop boot ofovirteng01 at grub prompt to simulate problem in boot (for example system put in console mode due to filesystem problem) In the mean time ovirteng02 becomes master of drbd resource, but doesn't start the group Can you attach the following file from ovirteng02: /var/lib/pacemaker/pengine/pe-input-1082.bz2 That will hold the answer signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff
On Tue, Mar 11, 2014 at 11:52 PM, Andrew Beekhof and...@beekhof.net wrote: On 8 Mar 2014, at 11:31 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote: I provoke power off of ovirteng01. Fencing agent works ok on ovirteng02 and reboots it. I stop boot ofovirteng01 at grub prompt to simulate problem in boot (for example system put in console mode due to filesystem problem) In the mean time ovirteng02 becomes master of drbd resource, but doesn't start the group Can you attach the following file from ovirteng02: /var/lib/pacemaker/pengine/pe-input-1082.bz2 That will hold the answer Thanks for your time Andrew. Here it is: https://drive.google.com/file/d/0BwoPbcrMv8mvNXI0M0dYenlRUFU/edit?usp=sharing I note this inside the file: constraints rsc_colocation id=colocation-ovirt-ms_OvirtData-INFINITY rsc=ovirt rsc-role=Started score=INFINITY with-rsc=ms_OvirtData with-rsc-role=Master/ rsc_order first=ms_OvirtData first-action=promote id=order-ms_OvirtData-ovirt-mandatory then=ovirt then-action=start/ rsc_location id=cli-ban-ovirt-on-ovirteng02.localdomain.local rsc=ovirt role=Started node=ovirteng02.localdomain.local score=-INFINITY/ rsc_location rsc=ms_OvirtData id=drbd-fence-by-handler-ovirt-ms_OvirtData rule role=Master score=-INFINITY id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData expression attribute=#uname operation=ne value=ovirteng02.localdomain.local id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/ /rule /rsc_location /constraints does this mean that a constraint remained for some reason after a previous test, so that ovirteng02 is unable to run ovirt group? Can I check previous pe-input files to debug when constraint was put? By the way I just checked again both nodes with power off when primary and it works for both as expected. If I reproduce what above didn't work (so poweroff of ovirteng01 while master and with group running) the group correctly starts now on ovirteng02. While keeping ovirteng01 (rebooted by fencing agent) on grub prompt, the command pcs cluster edit gives this on ovirteng02: constraints rsc_colocation id=colocation-ovirt-ms_OvirtData-INFINITY rsc=ovirt rsc-role=Started score=INFINITY with-rsc=ms_OvirtData with-rsc-role=Master/ rsc_order first=ms_OvirtData first-action=promote id=order-ms_OvirtData-ovirt-mandatory then=ovirt then-action=start/ rsc_location rsc=ms_OvirtData id=drbd-fence-by-handler-ovirt-ms_OvirtData rule role=Master score=-INFINITY id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData expression attribute=#uname operation=ne value=ovirteng02.localdomain.local id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/ /rule /rsc_location /constraints So the problem seems to be the line rsc_location id=cli-ban-ovirt-on-ovirteng02.localdomain.local rsc=ovirt role=Started node=ovirteng02.localdomain.local score=-INFINITY/ correct? could it be the effect of a pcs resource move ovirt without a pcs resource clear ovirt? Gianluca ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff
On Wed, Mar 12, 2014 at 12:37 AM, Andrew Beekhof and...@beekhof.net wrote: It was put in when drbd called: fence-peer /usr/lib/drbd/crm-fence-peer.sh; When and why it called that is not my area of expertise though. The constraint put by crm-fence-peer.sh was rsc_location rsc=ms_OvirtData id=drbd-fence-by-handler- ovirt-ms_OvirtData rule role=Master score=-INFINITY id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData expression attribute=#uname operation=ne value=ovirteng02.localdomain. local id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/ and I think it was good, in the sense that from now on only ovirteng02 could start the drbd resource, as ovirteng01 was fenced. But the problem actually was the other constraint rsc_location id=cli-ban-ovirt-on- ovirteng02.localdomain.local rsc=ovirt role=Started node=ovirteng02.localdomain.local score=-INFINITY/ preventing ovirteng02 from running ovirt group. Going backward in the logs I see that the constraint was put 2 days before during my previous tests (I find it in pe-input-1066.bz2) And if I reproduce now a pcs resource move ovirt I see that the same constraint is put inside. and it is removed as I run pcs resource clear ovirt (I can run it on any node, not necessarily the one where i ran the move operation) Gianluca ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff
On 12 Mar 2014, at 10:32 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote: On Tue, Mar 11, 2014 at 11:52 PM, Andrew Beekhof and...@beekhof.net wrote: On 8 Mar 2014, at 11:31 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote: I provoke power off of ovirteng01. Fencing agent works ok on ovirteng02 and reboots it. I stop boot ofovirteng01 at grub prompt to simulate problem in boot (for example system put in console mode due to filesystem problem) In the mean time ovirteng02 becomes master of drbd resource, but doesn't start the group Can you attach the following file from ovirteng02: /var/lib/pacemaker/pengine/pe-input-1082.bz2 That will hold the answer Thanks for your time Andrew. Here it is: https://drive.google.com/file/d/0BwoPbcrMv8mvNXI0M0dYenlRUFU/edit?usp=sharing I note this inside the file: constraints rsc_colocation id=colocation-ovirt-ms_OvirtData-INFINITY rsc=ovirt rsc-role=Started score=INFINITY with-rsc=ms_OvirtData with-rsc-role=Master/ rsc_order first=ms_OvirtData first-action=promote id=order-ms_OvirtData-ovirt-mandatory then=ovirt then-action=start/ rsc_location id=cli-ban-ovirt-on-ovirteng02.localdomain.local rsc=ovirt role=Started node=ovirteng02.localdomain.local score=-INFINITY/ rsc_location rsc=ms_OvirtData id=drbd-fence-by-handler-ovirt-ms_OvirtData rule role=Master score=-INFINITY id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData expression attribute=#uname operation=ne value=ovirteng02.localdomain.local id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/ /rule /rsc_location /constraints does this mean that a constraint remained for some reason after a previous test, so that ovirteng02 is unable to run ovirt group? Can I check previous pe-input files to debug when constraint was put? It was put in when drbd called: fence-peer /usr/lib/drbd/crm-fence-peer.sh; When and why it called that is not my area of expertise though. By the way I just checked again both nodes with power off when primary and it works for both as expected. If I reproduce what above didn't work (so poweroff of ovirteng01 while master and with group running) the group correctly starts now on ovirteng02. While keeping ovirteng01 (rebooted by fencing agent) on grub prompt, the command pcs cluster edit gives this on ovirteng02: constraints rsc_colocation id=colocation-ovirt-ms_OvirtData-INFINITY rsc=ovirt rsc-role=Started score=INFINITY with-rsc=ms_OvirtData with-rsc-role=Master/ rsc_order first=ms_OvirtData first-action=promote id=order-ms_OvirtData-ovirt-mandatory then=ovirt then-action=start/ rsc_location rsc=ms_OvirtData id=drbd-fence-by-handler-ovirt-ms_OvirtData rule role=Master score=-INFINITY id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData expression attribute=#uname operation=ne value=ovirteng02.localdomain.local id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/ /rule /rsc_location /constraints So the problem seems to be the line rsc_location id=cli-ban-ovirt-on-ovirteng02.localdomain.local rsc=ovirt role=Started node=ovirteng02.localdomain.local score=-INFINITY/ correct? could it be the effect of a pcs resource move ovirt without a pcs resource clear ovirt? Gianluca ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
Hi, Andrew 2014-03-12 6:37 GMT+09:00 Andrew Beekhof and...@beekhof.net: Mar 07 13:24:14 [2528] vm01 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib 0xf91c10, configuration Thats interesting... is that with the fixes mentioned above? I'm sorry. The above-mentioned log is not outputted by the newest Pacemaker. The following logs come out in the newest thing. Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:377 ) trace: te_update_diff: Handling create operation for /cib/configuration 0x1c37c60, fencing-topology Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib/configuration 0x1c37c60, fencing-topology Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:377 ) trace: te_update_diff: Handling create operation for /cib/configuration 0x1c397a0, rsc_defaults Mar 12 10:43:38 [6124] vm02 crmd: (te_callbacks:493 ) error: te_update_diff: Ingoring create operation for /cib/configuration 0x1c397a0, rsc_defaults I checked code of te_update_diff. Should not the next judgment be changed if change of fencing-topology or rsc_defaults is processed as a configuration subordinate's change? diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c index dd57660..f97bab5 100644 --- a/crmd/te_callbacks.c +++ b/crmd/te_callbacks.c @@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg) if(xpath == NULL) { /* Version field, ignore */ -} else if(strstr(xpath, /cib/configuration/)) { +} else if(strstr(xpath, /cib/configuration)) { abort_transition(INFINITY, tg_restart, Non-status change, change); } else if(strstr(xpath, /XML_CIB_TAG_TICKETS[) || safe_str_eq(name, XML_CIB_TAG_TICKETS)) { How is such change? I attach report at this time. The trace log of te_update_diff is also contained. https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing Regards, Yusuke but it looks like crmsh is doing something funny with its updates... does anyone know what command it is running? The execution result of the following commands remained in /var/log/messages. Mar 7 13:24:14 vm01 cibadmin[2555]: notice: crm_log_args: Invoked: cibadmin -p -R --force I'm somewhat confused at this point if crmsh is using --replace, then why is it doing diff calculations? Or are replace operations only for the load operation? -- METRO SYSTEMS CO., LTD Yusuke Iida Mail: yusk.i...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?
12.03.2014 00:37, Andrew Beekhof wrote: ... I'm somewhat confused at this point if crmsh is using --replace, then why is it doing diff calculations? Or are replace operations only for the load operation? It uses on of two methods depending on pacemaker version. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff
On 12 Mar 2014, at 10:56 am, Gianluca Cecchi gianluca.cec...@gmail.com wrote: On Wed, Mar 12, 2014 at 12:37 AM, Andrew Beekhof and...@beekhof.net wrote: It was put in when drbd called: fence-peer /usr/lib/drbd/crm-fence-peer.sh; When and why it called that is not my area of expertise though. The constraint put by crm-fence-peer.sh was rsc_location rsc=ms_OvirtData id=drbd-fence-by-handler- ovirt-ms_OvirtData rule role=Master score=-INFINITY id=drbd-fence-by-handler-ovirt-rule-ms_OvirtData expression attribute=#uname operation=ne value=ovirteng02.localdomain. local id=drbd-fence-by-handler-ovirt-expr-ms_OvirtData/ and I think it was good, in the sense that from now on only ovirteng02 could start the drbd resource, as ovirteng01 was fenced. But the problem actually was the other constraint rsc_location id=cli-ban-ovirt-on- ovirteng02.localdomain.local rsc=ovirt role=Started node=ovirteng02.localdomain.local score=-INFINITY/ preventing ovirteng02 from running ovirt group. Going backward in the logs I see that the constraint was put 2 days before during my previous tests (I find it in pe-input-1066.bz2) And if I reproduce now a pcs resource move ovirt Ah, yes. This would explain that part. I see that the same constraint is put inside. and it is removed as I run pcs resource clear ovirt (I can run it on any node, not necessarily the one where i ran the move operation) Gianluca ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org