[ClusterLabs] Antw: DRBD and SSD TRIM - Slow!
Hi! I know little about trim operations, but you could try one of these: 1) iotop to see whether some I/O is done during trimming (assuming trimming itself is not considered to be I/O) 2) Try blocktrace on the affected devices to see what's going on. It's hard to set up and to extract the info you are looking for, but it provides deep insights 3) Watch /sys/block/$BDEV/stat for performance statistics. I don't know how well DRBD supports these, however (e.g. MDRAID shows no wait times and no busy operations, while a multipath map has it all). Regards, Ulrich >>> Eric Robinson schrieb am 02.08.2017 um 07:09 in Nachricht > Does anyone know why trimming a filesystem mounted on a DRBD volume takes so > long? I mean like three days to trim a 1.2TB filesystem. > > Here are some pertinent details: > > OS: SLES 12 SP2 > Kernel: 4.4.74-92.29 > Drives: 6 x Samsung SSD 840 Pro 512GB > RAID: 0 (mdraid) > DRBD: 9.0.8 > Protocol: C > Network: Gigabit > Utilization: 10% > Latency: < 1ms > Loss: 0% > Iperf test: 900 mbits/sec > > When I write to a non-DRBD partition, I get 400MB/sec (bypassing caches). > When I trim a non-DRBD partition, it completes fast. > When I write to a DRBD volume, I get 80MB/sec. > > When I trim a DRBD volume, it takes bloody ages! > > -- > Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] DRBD and SSD TRIM - Slow!
Does anyone know why trimming a filesystem mounted on a DRBD volume takes so long? I mean like three days to trim a 1.2TB filesystem. Here are some pertinent details: OS: SLES 12 SP2 Kernel: 4.4.74-92.29 Drives: 6 x Samsung SSD 840 Pro 512GB RAID: 0 (mdraid) DRBD: 9.0.8 Protocol: C Network: Gigabit Utilization: 10% Latency: < 1ms Loss: 0% Iperf test: 900 mbits/sec When I write to a non-DRBD partition, I get 400MB/sec (bypassing caches). When I trim a non-DRBD partition, it completes fast. When I write to a DRBD volume, I get 80MB/sec. When I trim a DRBD volume, it takes bloody ages! -- Eric Robinson ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] from where does the default value for start/stop op of a resource come ?
Hi, i'm wondering from where the default values for operations of a resource come from. I tried to configure: crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params drbd_resource=idcc-devel \ > op monitor interval=60 WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than the advised 240 WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than the advised 100 WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it may not be supported by the RA from where come the default timeout of 20 sec ? My config does not have it in his op_default section: Is it hardcoded ? All timeouts i found in my config were explicitly related to a dedicated resource. What are the values for the hardcoded defaults ? Does that also mean that what the description of the RA says as "default" isn't a default, but just a recommendation ? crm(live)# ra info ocf:linbit:drbd ... Operations' defaults (advisory minimum): start timeout=240 promote timeout=90 demotetimeout=90 notifytimeout=90 stop timeout=100 monitor_Slave timeout=20 interval=20 monitor_Master timeout=20 interval=10 is not implemented by default ? Is it explicitly necessary to configure start/stop operations and related timeouts ? What if i don't do that ? I use default values i don't know ? Bernd -- Bernd Lentes Systemadministration institute of developmental genetics Gebäude 35.34 - Raum 208 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 (0)89 3187 1241 fax: +49 (0)89 3187 2294 no backup - no mercy Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.
On 2017-08-01 03:05, Stephen Carville (HA List) wrote: Can clustering even be done reliably on CentOS 6? I have no objection to moving to 7 but I was hoping I could get this up quicker than building out a bunch of new balancers. I have a number of centos 6 active/passive pairs running heartbeat r1 on centos 6. However, I've been doing it for some time and have a collection of mon scripts for them -- you'd have to roll your own. the duplicate IP to its own eth0. I probably do not need to tell you the mischief that can cause if these were production servers. Really? 'Cause over here it starts with "checking if ip already exists on the network" and one of them is supposed to fail there. Dima ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off
Hey Marek, I've run the command with --action off and uploaded the file on one of our servers : https://cloud.iwgate.com/index.php/s/1SpZlG8mBSR1dNE Interesting thing is that at the end of the file I found "Unable to connect/login to fencing device" instead of "Failed: Timed out waiting to power OFF" As information about my test rig: Host OS: VMware ESXi 6.5 Hypervisor Guest OS: Centos 7.3.1611 minimal with the latest updates Fence agents installed with yum : fence-agents-hpblade-4.0.11-47.el7_3.5.x86_64 fence-agents-rsa-4.0.11-47.el7_3.5.x86_64 fence-agents-ilo-moonshot-4.0.11-47.el7_3.5.x86_64 fence-agents-rhevm-4.0.11-47.el7_3.5.x86_64 fence-virt-0.3.2-5.el7.x86_64 fence-agents-mpath-4.0.11-47.el7_3.5.x86_64 fence-agents-ibmblade-4.0.11-47.el7_3.5.x86_64 fence-agents-ipdu-4.0.11-47.el7_3.5.x86_64 fence-agents-common-4.0.11-47.el7_3.5.x86_64 fence-agents-rsb-4.0.11-47.el7_3.5.x86_64 fence-agents-ilo-ssh-4.0.11-47.el7_3.5.x86_64 fence-agents-bladecenter-4.0.11-47.el7_3.5.x86_64 fence-agents-drac5-4.0.11-47.el7_3.5.x86_64 fence-agents-brocade-4.0.11-47.el7_3.5.x86_64 fence-agents-wti-4.0.11-47.el7_3.5.x86_64 fence-agents-compute-4.0.11-47.el7_3.5.x86_64 fence-agents-eps-4.0.11-47.el7_3.5.x86_64 fence-agents-cisco-ucs-4.0.11-47.el7_3.5.x86_64 fence-agents-intelmodular-4.0.11-47.el7_3.5.x86_64 fence-agents-eaton-snmp-4.0.11-47.el7_3.5.x86_64 fence-agents-cisco-mds-4.0.11-47.el7_3.5.x86_64 fence-agents-apc-snmp-4.0.11-47.el7_3.5.x86_64 fence-agents-ilo2-4.0.11-47.el7_3.5.x86_64 fence-agents-all-4.0.11-47.el7_3.5.x86_64 fence-agents-vmware-soap-4.0.11-47.el7_3.5.x86_64 fence-agents-ilo-mp-4.0.11-47.el7_3.5.x86_64 fence-agents-apc-4.0.11-47.el7_3.5.x86_64 fence-agents-emerson-4.0.11-47.el7_3.5.x86_64 fence-agents-ipmilan-4.0.11-47.el7_3.5.x86_64 fence-agents-ifmib-4.0.11-47.el7_3.5.x86_64 fence-agents-kdump-4.0.11-47.el7_3.5.x86_64 fence-agents-scsi-4.0.11-47.el7_3.5.x86_64 Thank you On Tue, Aug 1, 2017 at 2:22 PM, Marek Grac wrote: > Hi, > > > But when I call any of the power actions (on, off, reboot) I get "Failed: >> > Timed out waiting to power OFF". >> > >> > I've tried with all the combinations of --power-timeout and --power-wait >> > and same error without any change in the response time. >> > >> > Any ideas from where or how to fix this issue ? >> > > No, you have used the right options and if they were high enough it should > work. You can try to post verbose (anonymized) output and we can take a > look at it more deeply. > > >> >> I suspect "power off" is actually a virtual press of the ACPI power >> button (reboot likewise), so your VM tries to shut down cleanly. That could >> take time, and it could hang (I guess). I don't use VMware, but maybe >> there's a "reset" action that presses the virtual reset button of the >> virtual hardware... ;-) >> > > There should not be a fence agent that will do soft reboot. The 'reset' > action does power off/check status/power on so we are sure that machine > was really down (of course unless --method cycle when 'reboot' button is > used). > > m, > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.
On Tue, Aug 1, 2017 at 2:05 AM, Stephen Carville (HA List) < 62d2a...@opayq.com> wrote: > On 07/31/2017 11:13 PM, Ulrich Windl [Masked] wrote: > > I guess you have no fencing configured, right? > > No. I didn't realize it was necessary unless there was shared storage > involved. I guess it is time to go back to the drawing board. Can > clustering even be done reliably on CentOS 6? Yes, it can. I have a number of CentOS 6 clusters running with corosync and pacemaker, and CentOS 6, while obviously not the latest version, is still maintained and will be for at least a couple more years. But yes, you have to have fencing to have a cluster. I believe there is a way to manually tell one node of the cluster that the other node has been reset (using stonith_admin I think), but without fencing you are likely to end up in the state where you have to manually reset things to get the cluster going again any time something goes wrong, which is not exactly the high availability that you build a cluster for in the first place. --Greg ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Clusterlabs Summit 2017 (Sept. 6-7 in Nuremberg) - One month left!
Hey everyone! Here's a quick update for the upcoming Clusterlabs Summit at the SUSE office in Nuremberg in September: The time to register for the pool of hotel rooms has now expired - we have sent the final list of names to the hotel. There may still be hotel rooms available at the Sorat Saxx or other hotels in Nuremberg, so if anyone missed the deadline and still needs a room, either contact me or feel free to contact the hotel directly. The same goes for any changes, for those who have reservations: Please either contact me, or contact the hotel directly at i...@saxx-nuernberg.de. The schedule is being sorted out right now, and the planning wiki will be updated with a preliminary schedule soon. If there is anyone who would like to present on a topic or would like to discuss a topic that isn't on the wiki yet, now is the time to add it there. Other than that, I don't have any other remarks, other than to wish everyone welcome to Nuremberg in a month! Feel free to contact me with any concerns or issues related to the summit, and I'll do what I can to help out. Cheers, Kristoffer -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off
Hi, > But when I call any of the power actions (on, off, reboot) I get "Failed: > > Timed out waiting to power OFF". > > > > I've tried with all the combinations of --power-timeout and --power-wait > > and same error without any change in the response time. > > > > Any ideas from where or how to fix this issue ? > No, you have used the right options and if they were high enough it should work. You can try to post verbose (anonymized) output and we can take a look at it more deeply. > > I suspect "power off" is actually a virtual press of the ACPI power button > (reboot likewise), so your VM tries to shut down cleanly. That could take > time, and it could hang (I guess). I don't use VMware, but maybe there's a > "reset" action that presses the virtual reset button of the virtual > hardware... ;-) > There should not be a fence agent that will do soft reboot. The 'reset' action does power off/check status/power on so we are sure that machine was really down (of course unless --method cycle when 'reboot' button is used). m, ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: DRBD AND cLVM ???
- On Aug 1, 2017, at 8:06 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: "Lentes, Bernd" schrieb am 31.07.2017 > um > 18:51 in Nachricht > <641329685.12981098.1501519915026.javamail.zim...@helmholtz-muenchen.de>: >> Hi, >> >> i'm currently a bit confused. I have several resources running as >> VirtualDomains, the vm reside on plain logical volumes without fs, these > lv's >> reside themself on a FC SAN. >> In that scenario i need cLVM to distribute the lvm metadata between the >> nodes. >> For playing around a bit and getting used to it i created a DRBD partition. > >> It resides on a logical volume (one on each node), which should be possible > >> following the documentation on linbit. >> The lv's reside each on a node on the local storage, not on the SAN (which >> would be a very strange configuration). > > So you use cLVM to crate local VGs, and you use DRBD to sync the local LVs? > Why don't you use the shared SAN? > I use it too. I just want to deal with DRBD and learn about it. >> But nevertheless it's a cLVM configuration. I don't think it's possible to >> have a cLVM and non-cLVM configuration at the same time on the same node. > > YOu can definitely have clustered and non-clustered VGs on one node. > >> Is that possible what i try to do ? > > I'm still wonderinh what you really want to accieve. See above. > > Regards, > Ulrich > >> >> >> Bernd >> >> >> -- >> Bernd Lentes >> >> Systemadministration >> institute of developmental genetics >> Gebäude 35.34 - Raum 208 >> HelmholtzZentrum München >> bernd.len...@helmholtz-muenchen.de >> phone: +49 (0)89 3187 1241 >> fax: +49 (0)89 3187 2294 >> >> no backup - no mercy >> >> >> Helmholtz Zentrum Muenchen >> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) >> Ingolstaedter Landstr. 1 >> 85764 Neuherberg >> www.helmholtz-muenchen.de >> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe >> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons >> Enhsen >> Registergericht: Amtsgericht Muenchen HRB 6466 >> USt-IdNr: DE 129521671 >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off
Hello Ulrich, Thank you for the reply. Tested that and also the reset action fail with the same message. I forgot to tell that the vm guests are centos 7.3 and they power off in like 2 seconds, and a full reboot takes like 10 seconds. Also in VMware I see the soap task for "get id for UUID" but the comand for power is not there. Regards Octavian On Aug 1, 2017 9:12 AM, "Ulrich Windl" wrote: >>> Octavian Ciobanu schrieb am 31.07.2017 um 20:16 in Nachricht : > Hello, > > Before I implement the cluster I'm testing the fence agents and I got stuck > at the rebooting the VMware based VMs. > > I have installed VMware ESXi 6.5 Hypervisor with 5 VMs. > > If I call : > # fence_vmware_soap --ssl --ip esxi_ip --username root --password pass > --action list > I get the list with the names and UUIDs of the VMs. > > If I call : > # fence_vmware_soap --ssl --ip esxi_ip --username root --password pass > --action status --plug "564d5bce-3c55-2b02-1a8b-052c1fd24d6d" > I get the status of the VM. > > But when I call any of the power actions (on, off, reboot) I get "Failed: > Timed out waiting to power OFF". > > I've tried with all the combinations of --power-timeout and --power-wait > and same error without any change in the response time. > > Any ideas from where or how to fix this issue ? I suspect "power off" is actually a virtual press of the ACPI power button (reboot likewise), so your VM tries to shut down cleanly. That could take time, and it could hang (I guess). I don't use VMware, but maybe there's a "reset" action that presses the virtual reset button of the virtual hardware... ;-) Regards, Ulrich > > Thank you in advance. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: After reboot each node thinks the other is offline.
>>> "Stephen Carville (HA List)" <62d2a...@opayq.com> schrieb am 01.08.2017 um 10:05 in Nachricht : > On 07/31/2017 11:13 PM, Ulrich Windl [Masked] wrote: > >>> I am experimenting with pacemaker for high availability for some load >>> balancers. I was able to sucessfully get two CentOS (6.9) machines >>> (scahadev01da and scahadev01db) to form a cluster and the shared IP was >>> assigned to scahadev01da. I simulated a failure by halting the primary >>> and the secondary eventually noticed bringing up the shared IP on its >>> eth0. So far, so good. >>> >>> A problem arises when the primary comes back up and, for some reason, >>> each node thinks the other is offline. This leads to both nodes adding >> >> If a node thinks the other is unexpectedly offline, it will fence it, and > then it will be offline! Thus the IP can't run there. I guess you have no > fencing configured, right? > > No. I didn't realize it was necessary unless there was shared storage > involved. I guess it is time to go back to the drawing board. Can > clustering even be done reliably on CentOS 6? I have no objection to > moving to 7 but I was hoping I could get this up quicker than building > out a bunch of new balancers. > > On a related note: I tried rebooting both nodes and each node still > thinks the other is offline. For future reference is there a way to > clear that? If you start both nodes (and wait for a while), both nodes should appear as online (on each node). If it does not happen, there may be some communication or configuration problem. Before investing much time on the old version, I'd go forward to the current OS (personal preference)... Regards, Ulrich > >> Regards, >> Ulrich >> >>> the duplicate IP to its own eth0. I probably do not need to tell you >>> the mischief that can cause if these were production servers. >>> >>> I tried restarting cman, pcsd and pacemaker on both machines with no >>> effect on the situation. >>> >>> I've found several mentions of it in the search engines but I've been >>> unable to find how to fix it. Any help is appreciated >>> >>> Both nodes have quorum disabled in /etc/sysconfig/cman >>> >>> CMAN_QUORUM_TIMEOUT=0 >>> >>> # >>> Node 1 >>> >>> scahadev01da# sudo pcs status >>> Cluster name: scahadev01d >>> Stack: cman >>> Current DC: scahadev01da (version 1.1.15-5.el6-e174ec8) - partition >>> WITHOUT quorum >>> Last updated: Mon Jul 31 10:43:54 2017 Last change: Mon Jul 31 >>> 10:30:46 >>> 2017 by root via cibadmin on scahadev01da >>> >>> 2 nodes and 1 resource configured >>> >>> Online: [ scahadev01da ] >>> OFFLINE: [ scahadev01db ] >>> >>> Full list of resources: >>> >>> VirtualIP (ocf::heartbeat:IPaddr2): Started scahadev01da >>> >>> Daemon Status: >>> cman: active/enabled >>> corosync: active/disabled >>> pacemaker: active/enabled >>> pcsd: active/enabled >>> >>> # >>> Node 2 >>> >>> scahadev01db ~]$ sudo pcs status >>> Cluster name: scahadev01d >>> Stack: cman >>> Current DC: scahadev01db (version 1.1.15-5.el6-e174ec8) - partition >>> WITHOUT quorum >>> Last updated: Mon Jul 31 10:43:47 2017 Last change: Sat Jul 29 >>> 13:45:15 >>> 2017 by root via cibadmin on scahadev01da >>> >>> 2 nodes and 1 resource configured >>> >>> Online: [ scahadev01db ] >>> OFFLINE: [ scahadev01da ] >>> >>> Full list of resources: >>> >>> VirtualIP (ocf::heartbeat:IPaddr2): Started scahadev01db >>> >>> Daemon Status: >>> cman: active/enabled >>> corosync: active/disabled >>> pacemaker: active/enabled >>> pcsd: active/enabled >>> >>> -- >>> Stephen Carville >>> >>> ___ >>> Users mailing list: Users@clusterlabs.org >>> http://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.
On 07/31/2017 11:13 PM, Ulrich Windl [Masked] wrote: >> I am experimenting with pacemaker for high availability for some load >> balancers. I was able to sucessfully get two CentOS (6.9) machines >> (scahadev01da and scahadev01db) to form a cluster and the shared IP was >> assigned to scahadev01da. I simulated a failure by halting the primary >> and the secondary eventually noticed bringing up the shared IP on its >> eth0. So far, so good. >> >> A problem arises when the primary comes back up and, for some reason, >> each node thinks the other is offline. This leads to both nodes adding > > If a node thinks the other is unexpectedly offline, it will fence it, and > then it will be offline! Thus the IP can't run there. I guess you have no > fencing configured, right? No. I didn't realize it was necessary unless there was shared storage involved. I guess it is time to go back to the drawing board. Can clustering even be done reliably on CentOS 6? I have no objection to moving to 7 but I was hoping I could get this up quicker than building out a bunch of new balancers. On a related note: I tried rebooting both nodes and each node still thinks the other is offline. For future reference is there a way to clear that? > Regards, > Ulrich > >> the duplicate IP to its own eth0. I probably do not need to tell you >> the mischief that can cause if these were production servers. >> >> I tried restarting cman, pcsd and pacemaker on both machines with no >> effect on the situation. >> >> I've found several mentions of it in the search engines but I've been >> unable to find how to fix it. Any help is appreciated >> >> Both nodes have quorum disabled in /etc/sysconfig/cman >> >> CMAN_QUORUM_TIMEOUT=0 >> >> # >> Node 1 >> >> scahadev01da# sudo pcs status >> Cluster name: scahadev01d >> Stack: cman >> Current DC: scahadev01da (version 1.1.15-5.el6-e174ec8) - partition >> WITHOUT quorum >> Last updated: Mon Jul 31 10:43:54 2017 Last change: Mon Jul 31 >> 10:30:46 >> 2017 by root via cibadmin on scahadev01da >> >> 2 nodes and 1 resource configured >> >> Online: [ scahadev01da ] >> OFFLINE: [ scahadev01db ] >> >> Full list of resources: >> >> VirtualIP (ocf::heartbeat:IPaddr2): Started scahadev01da >> >> Daemon Status: >> cman: active/enabled >> corosync: active/disabled >> pacemaker: active/enabled >> pcsd: active/enabled >> >> # >> Node 2 >> >> scahadev01db ~]$ sudo pcs status >> Cluster name: scahadev01d >> Stack: cman >> Current DC: scahadev01db (version 1.1.15-5.el6-e174ec8) - partition >> WITHOUT quorum >> Last updated: Mon Jul 31 10:43:47 2017 Last change: Sat Jul 29 >> 13:45:15 >> 2017 by root via cibadmin on scahadev01da >> >> 2 nodes and 1 resource configured >> >> Online: [ scahadev01db ] >> OFFLINE: [ scahadev01da ] >> >> Full list of resources: >> >> VirtualIP (ocf::heartbeat:IPaddr2): Started scahadev01db >> >> Daemon Status: >> cman: active/enabled >> corosync: active/disabled >> pacemaker: active/enabled >> pcsd: active/enabled >> >> -- >> Stephen Carville >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org