Re: [Pacemaker] DRBD with Pacemaker on CentOs 6.5
Try this. digimer is an expert at what you are trying to do. https://alteeve.ca/w/AN!Cluster_Tutorial_2 On Thu, Oct 23, 2014 at 1:05 PM, David Pendell losto...@gmail.com wrote: Try this. https://alteeve.ca/w/AN!Cluster_Tutorial_2 On Wed, Oct 22, 2014 at 8:08 PM, Sihan Goi gois...@gmail.com wrote: Hi, can anyone help? Really stuck here... On Mon, Oct 20, 2014 at 9:46 AM, Sihan Goi gois...@gmail.com wrote: Hi, I'm following the Clusters from Scratch guide for Fedora 13, and I've managed to get a 2 node cluster working with Apache. However, once I tried to add DRBD 8.4 to the mix, it stopped working. I've followed the DRBD steps in the guide all the way till cib commit fs in Section 7.4, right before Testing Migration. However, when I do a crm_mon, I get the following failed actions. Last updated: Thu Oct 16 17:28:34 2014 Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01 Stack: cman Current DC: node02 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 5 Resources configured Online: [ node01 node02 ] ClusterIP(ocf::heartbeat:IPaddr2):Started node02 Master/Slave Set: WebDataClone [WebData] Masters: [ node02 ] Slaves: [ node01 ] WebFS (ocf::heartbeat:Filesystem):Started node02 Failed actions: WebSite_start_0 on node02 'unknown error' (1): call=278, status=Timed Out, last-rc-change='Thu Oct 16 17:26:28 2014', queued=2ms, exec=0ms WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed Out, last-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms Seems like the apache Website resource isn't starting up. Apache was working just fine before I configured DRBD. What did I do wrong? -- - Goi Sihan gois...@gmail.com -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD with Pacemaker on CentOs 6.5
Try this. https://alteeve.ca/w/AN!Cluster_Tutorial_2 On Wed, Oct 22, 2014 at 8:08 PM, Sihan Goi gois...@gmail.com wrote: Hi, can anyone help? Really stuck here... On Mon, Oct 20, 2014 at 9:46 AM, Sihan Goi gois...@gmail.com wrote: Hi, I'm following the Clusters from Scratch guide for Fedora 13, and I've managed to get a 2 node cluster working with Apache. However, once I tried to add DRBD 8.4 to the mix, it stopped working. I've followed the DRBD steps in the guide all the way till cib commit fs in Section 7.4, right before Testing Migration. However, when I do a crm_mon, I get the following failed actions. Last updated: Thu Oct 16 17:28:34 2014 Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01 Stack: cman Current DC: node02 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 5 Resources configured Online: [ node01 node02 ] ClusterIP(ocf::heartbeat:IPaddr2):Started node02 Master/Slave Set: WebDataClone [WebData] Masters: [ node02 ] Slaves: [ node01 ] WebFS (ocf::heartbeat:Filesystem):Started node02 Failed actions: WebSite_start_0 on node02 'unknown error' (1): call=278, status=Timed Out, last-rc-change='Thu Oct 16 17:26:28 2014', queued=2ms, exec=0ms WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed Out, last-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms Seems like the apache Website resource isn't starting up. Apache was working just fine before I configured DRBD. What did I do wrong? -- - Goi Sihan gois...@gmail.com -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD with Pacemaker on CentOs 6.5
By the way, you want to configure DRBD before you configure Apache. You start from the bottom up. Get a fully working platform upon which to build. Make sure that DRBD is working and that fencing is *in place and working*; DON'T SKIP THIS! Then build Apache on top of that. d.p. On Thu, Oct 23, 2014 at 1:05 PM, David Pendell losto...@gmail.com wrote: Try this. digimer is an expert at what you are trying to do. https://alteeve.ca/w/AN!Cluster_Tutorial_2 On Thu, Oct 23, 2014 at 1:05 PM, David Pendell losto...@gmail.com wrote: Try this. https://alteeve.ca/w/AN!Cluster_Tutorial_2 On Wed, Oct 22, 2014 at 8:08 PM, Sihan Goi gois...@gmail.com wrote: Hi, can anyone help? Really stuck here... On Mon, Oct 20, 2014 at 9:46 AM, Sihan Goi gois...@gmail.com wrote: Hi, I'm following the Clusters from Scratch guide for Fedora 13, and I've managed to get a 2 node cluster working with Apache. However, once I tried to add DRBD 8.4 to the mix, it stopped working. I've followed the DRBD steps in the guide all the way till cib commit fs in Section 7.4, right before Testing Migration. However, when I do a crm_mon, I get the following failed actions. Last updated: Thu Oct 16 17:28:34 2014 Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01 Stack: cman Current DC: node02 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 5 Resources configured Online: [ node01 node02 ] ClusterIP(ocf::heartbeat:IPaddr2):Started node02 Master/Slave Set: WebDataClone [WebData] Masters: [ node02 ] Slaves: [ node01 ] WebFS (ocf::heartbeat:Filesystem):Started node02 Failed actions: WebSite_start_0 on node02 'unknown error' (1): call=278, status=Timed Out, last-rc-change='Thu Oct 16 17:26:28 2014', queued=2ms, exec=0ms WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed Out, last-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms Seems like the apache Website resource isn't starting up. Apache was working just fine before I configured DRBD. What did I do wrong? -- - Goi Sihan gois...@gmail.com -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Self-Fence???
I have a two-node CentOS 6.4 based cluster, using pacemaker 1.1.8 with a cman backend running primarily libvirt controlled kvm VMs. For the VMs, I am using clvm volumes for the virtual hard drives and a single gfs2 volume for shared storage of the config files for the VMs and other shared data. For fencing, I use ipmi and a apc master switch to provide redundant fencing. There are location constraints that do not allow the fencing resources run on their own node. I am *not* using sbd or any other software based fencing device. I had a very bizarre situation this morning -- I had one of the nodes powered off. Then the other self-fenced. I thought that was impossible. Excerpts from the logs: Mar 28 13:10:01 virtualhost2 stonith-ng[4223]: notice: remote_op_done: Operation reboot of virtualhost2.delta-co.gov by virtualhost1.delta-co.gov for crmd.4...@virtualhost1.delta-co.gov.fc5638ad: Timer expired [...] Virtualhost1 was offline, so I expect that line. [...] Mar 28 13:13:30 virtualhost2 pengine[4226]: notice: unpack_rsc_op: Preventing p_ns2 from re-starting on virtualhost2.delta-co.gov: operation monitor failed 'not installed' (rc=5) [...] If I had a brief interruption of my gfs2 volume, would that show up? And would it be the cause of a fencing operation? [...] Mar 28 13:13:30 virtualhost2 pengine[4226]: warning: pe_fence_node: Node virtualhost2.delta-co.gov will be fenced to recover from resource failure(s) Mar 28 13:13:30 virtualhost2 pengine[4226]: warning: stage6: Scheduling Node virtualhost2.delta-co.gov for STONITH [...] Why is it still trying to fence, if all of the fencing resources are offline? [...] Mar 28 13:13:30 virtualhost2 crmd[4227]: notice: te_fence_node: Executing reboot fencing operation (43) on virtualhost2.delta-co.gov (timeout=6) Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: handle_request: Client crmd.4227.9fdec3bd wants to fence (reboot) 'virtualhost2.delta-co.gov' with device '(any)' [...] What does that mean? crmd.4227.9fdec3bd I figure 4227 is a process number, but I don't what the next number is. [...] Mar 28 13:13:30 virtualhost2 stonith-ng[4223]:error: check_alternate_host: No alternate host available to handle complex self fencing request [...] Where did that come from? [...] Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: check_alternate_host: Peer[1] virtualhost1.delta-co.gov Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: check_alternate_host: Peer[2] virtualhost2.delta-co.gov Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for virtualhost2.delta-co.gov: 648ca743-6cda-4c81-9250-21c9109a51b9 (0) [...] The next logs are the reboot logs. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Self-Fence???
Ok, then. I learned something new. Thanks. d.p. On Thu, Mar 28, 2013 at 6:28 PM, Andrew Beekhof and...@beekhof.net wrote: On Fri, Mar 29, 2013 at 7:42 AM, David Pendell losto...@gmail.com wrote: I have a two-node CentOS 6.4 based cluster, using pacemaker 1.1.8 with a cman backend running primarily libvirt controlled kvm VMs. For the VMs, I am using clvm volumes for the virtual hard drives and a single gfs2 volume for shared storage of the config files for the VMs and other shared data. For fencing, I use ipmi and a apc master switch to provide redundant fencing. There are location constraints that do not allow the fencing resources run on their own node. I am *not* using sbd or any other software based fencing device. I had a very bizarre situation this morning -- I had one of the nodes powered off. Then the other self-fenced. I thought that was impossible. No. Not when a node is by itself. Excerpts from the logs: Mar 28 13:10:01 virtualhost2 stonith-ng[4223]: notice: remote_op_done: Operation reboot of virtualhost2.delta-co.gov by virtualhost1.delta-co.gov for crmd.4...@virtualhost1.delta-co.gov.fc5638ad: Timer expired [...] Virtualhost1 was offline, so I expect that line. [...] Mar 28 13:13:30 virtualhost2 pengine[4226]: notice: unpack_rsc_op: Preventing p_ns2 from re-starting on virtualhost2.delta-co.gov: operation monitor failed 'not installed' (rc=5) [...] If I had a brief interruption of my gfs2 volume, would that show up? And would it be the cause of a fencing operation? [...] Mar 28 13:13:30 virtualhost2 pengine[4226]: warning: pe_fence_node: Node virtualhost2.delta-co.gov will be fenced to recover from resource failure(s) Mar 28 13:13:30 virtualhost2 pengine[4226]: warning: stage6: Scheduling Node virtualhost2.delta-co.gov for STONITH [...] Why is it still trying to fence, if all of the fencing resources are offline? [...] Mar 28 13:13:30 virtualhost2 crmd[4227]: notice: te_fence_node: Executing reboot fencing operation (43) on virtualhost2.delta-co.gov(timeout=6) Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: handle_request: Client crmd.4227.9fdec3bd wants to fence (reboot) 'virtualhost2.delta-co.gov' with device '(any)' [...] What does that mean? crmd.4227.9fdec3bd I figure 4227 is a process number, but I don't what the next number is. [...] Mar 28 13:13:30 virtualhost2 stonith-ng[4223]:error: check_alternate_host: No alternate host available to handle complex self fencing request [...] Where did that come from? It was scheduled by the policy engine (because a resource failed to stop by the looks of it) and, as per the logs above, initiated by the crmd. [...] Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: check_alternate_host: Peer[1] virtualhost1.delta-co.gov Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: check_alternate_host: Peer[2] virtualhost2.delta-co.gov Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for virtualhost2.delta-co.gov: 648ca743-6cda-4c81-9250-21c9109a51b9 (0) [...] The next logs are the reboot logs. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Live migration question.
I'll try those; thanks. d.p. On Sun, Jul 29, 2012 at 9:02 PM, Andrew Beekhof and...@beekhof.net wrote: On Thu, Jul 12, 2012 at 2:26 PM, David Pendell losto...@gmail.com wrote: I have two cluster nodes that have a gigabit network between them for doing live migrations of running kvm VMs. If one of the two hosts go off line, naturally all of the guests then get restarted on the other host. But when the offline host then comes back online, all of the guests that were restarted on the online host then try to do a live migration. Given that I only have one gigabit link for doing the transfers, this creates a log jam. The result is that the VMs that timeout then do a move or rather shutdown VMs to restart them on the newly online node. For my Linux guests, this is annoying, but with a Windows VM it is a disaster, locking a very impatient dept out of their server for 3-4 minutes. ( One might think that this is a miracle, given that before I set up the cluster a server reboot would take ten minutes. Sigh. ) I would like to make the migrations sequential so that the Windows VM can migrate first, and then the next most important Linux VMs, etc. Is there any way to do this? Beekhof suggested that there were several alternatives and that the mailing list would be the best place to ask. You could try creating an ordering constraint between the two VMs, or setting batch-limit really small. I think there were some other options too. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Live migration question.
I have two cluster nodes that have a gigabit network between them for doing live migrations of running kvm VMs. If one of the two hosts go off line, naturally all of the guests then get restarted on the other host. But when the offline host then comes back online, all of the guests that were restarted on the online host then try to do a live migration. Given that I only have one gigabit link for doing the transfers, this creates a log jam. The result is that the VMs that timeout then do a move or rather shutdown VMs to restart them on the newly online node. For my Linux guests, this is annoying, but with a Windows VM it is a disaster, locking a very impatient dept out of their server for 3-4 minutes. ( One might think that this is a miracle, given that before I set up the cluster a server reboot would take ten minutes. Sigh. ) I would like to make the migrations sequential so that the Windows VM can migrate first, and then the next most important Linux VMs, etc. Is there any way to do this? Beekhof suggested that there were several alternatives and that the mailing list would be the best place to ask. lostogre ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org