date:20170801

[ClusterLabs] Antw: DRBD and SSD TRIM - Slow!

2017-08-01 Thread Ulrich Windl

Hi!

I know little about trim operations, but you could try one of these:

1) iotop to see whether some I/O is done during trimming (assuming trimming 
itself is not considered to be I/O)

2) Try blocktrace on the affected devices to see what's going on. It's hard to 
set up and to extract the info you are looking for, but it provides deep 
insights

3) Watch /sys/block/$BDEV/stat for performance statistics. I don't know how 
well DRBD supports these, however (e.g. MDRAID shows no wait times and no busy 
operations, while a multipath map has it all).

Regards,
Ulrich

>>> Eric Robinson  schrieb am 02.08.2017 um 07:09 in
Nachricht


> Does anyone know why trimming a filesystem mounted on a DRBD volume takes so 
> long? I mean like three days to trim a 1.2TB filesystem.
> 
> Here are some pertinent details:
> 
> OS: SLES 12 SP2
> Kernel: 4.4.74-92.29
> Drives: 6 x Samsung SSD 840 Pro 512GB
> RAID: 0 (mdraid)
> DRBD: 9.0.8
> Protocol: C
> Network: Gigabit
> Utilization: 10%
> Latency: < 1ms
> Loss: 0%
> Iperf test: 900 mbits/sec
> 
> When I write to a non-DRBD partition, I get 400MB/sec (bypassing caches).
> When I trim a non-DRBD partition, it completes fast.
> When I write to a DRBD volume, I get 80MB/sec.
> 
> When I trim a DRBD volume, it takes bloody ages!
> 
> --
> Eric Robinson





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] DRBD and SSD TRIM - Slow!

2017-08-01 Thread Eric Robinson

Does anyone know why trimming a filesystem mounted on a DRBD volume takes so 
long? I mean like three days to trim a 1.2TB filesystem.

Here are some pertinent details:

OS: SLES 12 SP2
Kernel: 4.4.74-92.29
Drives: 6 x Samsung SSD 840 Pro 512GB
RAID: 0 (mdraid)
DRBD: 9.0.8
Protocol: C
Network: Gigabit
Utilization: 10%
Latency: < 1ms
Loss: 0%
Iperf test: 900 mbits/sec

When I write to a non-DRBD partition, I get 400MB/sec (bypassing caches).
When I trim a non-DRBD partition, it completes fast.
When I write to a DRBD volume, I get 80MB/sec.

When I trim a DRBD volume, it takes bloody ages!

--
Eric Robinson

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] from where does the default value for start/stop op of a resource come ?

2017-08-01 Thread Lentes, Bernd

Hi,

i'm wondering from where the default values for operations of a resource come 
from.
I tried to configure:

crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params 
drbd_resource=idcc-devel \
   > op monitor interval=60
WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than 
the advised 240
WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than the 
advised 100
WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it 
may not be supported by the RA

from where come the default timeout of 20 sec ? 
My config does not have it in his op_default section:

   
  

  


Is it hardcoded ? All timeouts i found in my config were explicitly related to 
a dedicated resource.
What are the values for the hardcoded defaults ?

Does that also mean that what the description of the RA says as "default" isn't 
a default, but just a recommendation ?

crm(live)# ra info ocf:linbit:drbd

 ...

Operations'  defaults (advisory minimum):
 
start timeout=240
promote   timeout=90
demotetimeout=90
notifytimeout=90
stop  timeout=100
monitor_Slave timeout=20 interval=20
monitor_Master timeout=20 interval=10

is not implemented by default ?
Is it explicitly necessary to configure start/stop operations and related 
timeouts ?
What if i don't do that ? I use default values i don't know ?

Bernd



-- 
Bernd Lentes 

Systemadministration 
institute of developmental genetics 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 (0)89 3187 1241 
fax: +49 (0)89 3187 2294 

no backup - no mercy
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.

2017-08-01 Thread Dmitri Maziuk


On 2017-08-01 03:05, Stephen Carville (HA List) wrote:


 Can
clustering even be done reliably on CentOS 6?  I have no objection to
moving to 7 but I was hoping I could get this up quicker than building
out a bunch of new balancers.


I have a number of centos 6 active/passive pairs running heartbeat r1 on 
centos 6. However, I've been doing it for some time and have a 
collection of mon scripts for them -- you'd have to roll your own.



the duplicate IP to its own eth0.  I probably do not need to tell you
the mischief that can cause if these were production servers.


Really? 'Cause over here it starts with "checking if ip already exists 
on the network" and one of them is supposed to fail there.


Dima

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off

2017-08-01 Thread Octavian Ciobanu

Hey Marek,

I've run the command with --action off and uploaded the file on one of our
servers : https://cloud.iwgate.com/index.php/s/1SpZlG8mBSR1dNE

Interesting thing is that at the end of the file I found "Unable to
connect/login to fencing device" instead of "Failed: Timed out waiting to
power OFF"

As information about my test rig:
 Host OS: VMware ESXi 6.5 Hypervisor
 Guest OS: Centos 7.3.1611 minimal with the latest updates
 Fence agents installed with yum :
fence-agents-hpblade-4.0.11-47.el7_3.5.x86_64
fence-agents-rsa-4.0.11-47.el7_3.5.x86_64
fence-agents-ilo-moonshot-4.0.11-47.el7_3.5.x86_64
fence-agents-rhevm-4.0.11-47.el7_3.5.x86_64
fence-virt-0.3.2-5.el7.x86_64
fence-agents-mpath-4.0.11-47.el7_3.5.x86_64
fence-agents-ibmblade-4.0.11-47.el7_3.5.x86_64
fence-agents-ipdu-4.0.11-47.el7_3.5.x86_64
fence-agents-common-4.0.11-47.el7_3.5.x86_64
fence-agents-rsb-4.0.11-47.el7_3.5.x86_64
fence-agents-ilo-ssh-4.0.11-47.el7_3.5.x86_64
fence-agents-bladecenter-4.0.11-47.el7_3.5.x86_64
fence-agents-drac5-4.0.11-47.el7_3.5.x86_64
fence-agents-brocade-4.0.11-47.el7_3.5.x86_64
fence-agents-wti-4.0.11-47.el7_3.5.x86_64
fence-agents-compute-4.0.11-47.el7_3.5.x86_64
fence-agents-eps-4.0.11-47.el7_3.5.x86_64
fence-agents-cisco-ucs-4.0.11-47.el7_3.5.x86_64
fence-agents-intelmodular-4.0.11-47.el7_3.5.x86_64
fence-agents-eaton-snmp-4.0.11-47.el7_3.5.x86_64
fence-agents-cisco-mds-4.0.11-47.el7_3.5.x86_64
fence-agents-apc-snmp-4.0.11-47.el7_3.5.x86_64
fence-agents-ilo2-4.0.11-47.el7_3.5.x86_64
fence-agents-all-4.0.11-47.el7_3.5.x86_64
fence-agents-vmware-soap-4.0.11-47.el7_3.5.x86_64
fence-agents-ilo-mp-4.0.11-47.el7_3.5.x86_64
fence-agents-apc-4.0.11-47.el7_3.5.x86_64
fence-agents-emerson-4.0.11-47.el7_3.5.x86_64
fence-agents-ipmilan-4.0.11-47.el7_3.5.x86_64
fence-agents-ifmib-4.0.11-47.el7_3.5.x86_64
fence-agents-kdump-4.0.11-47.el7_3.5.x86_64
fence-agents-scsi-4.0.11-47.el7_3.5.x86_64

Thank you

On Tue, Aug 1, 2017 at 2:22 PM, Marek Grac  wrote:

> Hi,
>
> > But when I call any of the power actions (on, off, reboot) I get "Failed:
>> > Timed out waiting to power OFF".
>> >
>> > I've tried with all the combinations of --power-timeout and --power-wait
>> > and same error without any change in the response time.
>> >
>> > Any ideas from where or how to fix this issue ?
>>
>
> No, you have used the right options and if they were high enough it should
> work. You can try to post verbose (anonymized) output and we can take a
> look at it more deeply.
>
>
>>
>> I suspect "power off" is actually a virtual press of the ACPI power
>> button (reboot likewise), so your VM tries to shut down cleanly. That could
>> take time, and it could hang (I guess). I don't use VMware, but maybe
>> there's a "reset" action that presses the virtual reset button of the
>> virtual hardware... ;-)
>>
>
> There should not be a fence agent that will do soft reboot. The 'reset'
> action does  power off/check status/power on so we are sure that machine
> was really down (of course unless --method cycle when 'reboot' button is
> used).
>
> m,
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.

2017-08-01 Thread Greg Woods

On Tue, Aug 1, 2017 at 2:05 AM, Stephen Carville (HA List) <
62d2a...@opayq.com> wrote:

> On 07/31/2017 11:13 PM, Ulrich Windl [Masked] wrote:
>
>  I guess you have no fencing configured, right?
>
> No. I didn't realize it was necessary unless there was shared storage
> involved.  I guess it is time to go back to the drawing board.  Can
> clustering even be done reliably on CentOS 6?

Yes, it can. I have a number of CentOS 6 clusters running with corosync and
pacemaker, and CentOS 6, while obviously not the latest version, is still
maintained and will be for at least a couple more years. But yes, you have
to have fencing to have a cluster. I believe there is a way to manually
tell one node of the cluster that the other node has been reset (using
stonith_admin I think), but without fencing you are likely to end up in the
state where you have to manually reset things to get the cluster going
again any time something goes wrong, which is not exactly the high
availability that  you build a cluster for in the first place.

--Greg
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Clusterlabs Summit 2017 (Sept. 6-7 in Nuremberg) - One month left!

2017-08-01 Thread Kristoffer Grönlund

Hey everyone!

Here's a quick update for the upcoming Clusterlabs Summit at the SUSE
office in Nuremberg in September:

The time to register for the pool of hotel rooms has now expired - we
have sent the final list of names to the hotel. There may still be hotel
rooms available at the Sorat Saxx or other hotels in Nuremberg, so if
anyone missed the deadline and still needs a room, either contact me or
feel free to contact the hotel directly. The same goes for any changes,
for those who have reservations: Please either contact me, or contact
the hotel directly at i...@saxx-nuernberg.de.

The schedule is being sorted out right now, and the planning wiki will
be updated with a preliminary schedule soon. If there is anyone who
would like to present on a topic or would like to discuss a topic that
isn't on the wiki yet, now is the time to add it there.

Other than that, I don't have any other remarks, other than to wish
everyone welcome to Nuremberg in a month! Feel free to contact me with
any concerns or issues related to the summit, and I'll do what I can to
help out.

Cheers,
Kristoffer

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off

2017-08-01 Thread Marek Grac

Hi,

> But when I call any of the power actions (on, off, reboot) I get "Failed:
> > Timed out waiting to power OFF".
> >
> > I've tried with all the combinations of --power-timeout and --power-wait
> > and same error without any change in the response time.
> >
> > Any ideas from where or how to fix this issue ?
>

No, you have used the right options and if they were high enough it should
work. You can try to post verbose (anonymized) output and we can take a
look at it more deeply.


>
> I suspect "power off" is actually a virtual press of the ACPI power button
> (reboot likewise), so your VM tries to shut down cleanly. That could take
> time, and it could hang (I guess). I don't use VMware, but maybe there's a
> "reset" action that presses the virtual reset button of the virtual
> hardware... ;-)
>

There should not be a fence agent that will do soft reboot. The 'reset'
action does  power off/check status/power on so we are sure that machine
was really down (of course unless --method cycle when 'reboot' button is
used).

m,
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: DRBD AND cLVM ???

2017-08-01 Thread Lentes, Bernd



- On Aug 1, 2017, at 8:06 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:

 "Lentes, Bernd"  schrieb am 31.07.2017
> um
> 18:51 in Nachricht
> <641329685.12981098.1501519915026.javamail.zim...@helmholtz-muenchen.de>:
>> Hi,
>> 
>> i'm currently a bit confused. I have several resources running as
>> VirtualDomains, the vm reside on plain logical volumes without fs, these
> lv's
>> reside themself on a FC SAN.
>> In that scenario i need cLVM to distribute the lvm metadata between the
>> nodes.
>> For playing around a bit and getting used to it i created a DRBD partition.
> 
>> It resides on a logical volume (one on each node), which should be possible
> 
>> following the documentation on linbit.
>> The lv's reside each on a node on the local storage, not on the SAN (which
>> would be a very strange configuration).
> 
> So you use cLVM to crate local VGs, and you use DRBD to sync the local LVs?
> Why don't you use the shared SAN?
> 

I use it too. I just want to deal with DRBD and learn about it.

>> But nevertheless it's a cLVM configuration. I don't think it's possible to
>> have a cLVM and non-cLVM configuration at the same time on the same node.
> 
> YOu can definitely have clustered and non-clustered VGs on one node.
> 
>> Is that possible what i try to do ?
> 
> I'm still wonderinh what you really want to accieve.

See above.

> 
> Regards,
> Ulrich
> 
>> 
>> 
>> Bernd
>> 
>> 
>> --
>> Bernd Lentes
>> 
>> Systemadministration
>> institute of developmental genetics
>> Gebäude 35.34 - Raum 208
>> HelmholtzZentrum München
>> bernd.len...@helmholtz-muenchen.de
>> phone: +49 (0)89 3187 1241
>> fax: +49 (0)89 3187 2294
>> 
>> no backup - no mercy
>>  
>> 
>> Helmholtz Zentrum Muenchen
>> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
>> Ingolstaedter Landstr. 1
>> 85764 Neuherberg
>> www.helmholtz-muenchen.de
>> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
>> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons
>> Enhsen
>> Registergericht: Amtsgericht Muenchen HRB 6466
>> USt-IdNr: DE 129521671
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off

2017-08-01 Thread Octavian Ciobanu

Hello Ulrich,

Thank you for the reply.

Tested that and also the reset action fail with the same message.

I forgot to tell that the vm guests are centos 7.3 and they power off in
like 2 seconds, and a full reboot takes like 10 seconds.

Also in VMware I see the soap task for "get id for UUID" but the comand for
power is not there.

Regards
Octavian

On Aug 1, 2017 9:12 AM, "Ulrich Windl" 
wrote:

>>> Octavian Ciobanu  schrieb am 31.07.2017 um
20:16 in
Nachricht
:
> Hello,
>
> Before I implement the cluster I'm testing the fence agents and I got
stuck
> at the rebooting the VMware based VMs.
>
> I have installed VMware ESXi 6.5 Hypervisor with 5 VMs.
>
> If I call :
> # fence_vmware_soap --ssl --ip esxi_ip --username root --password pass
> --action list
> I get the list with the names and UUIDs of the VMs.
>
> If I call :
> # fence_vmware_soap --ssl --ip esxi_ip --username root --password pass
> --action status --plug "564d5bce-3c55-2b02-1a8b-052c1fd24d6d"
> I get the status of the VM.
>
> But when I call any of the power actions (on, off, reboot) I get "Failed:
> Timed out waiting to power OFF".
>
> I've tried with all the combinations of --power-timeout and --power-wait
> and same error without any change in the response time.
>
> Any ideas from where or how to fix this issue ?

I suspect "power off" is actually a virtual press of the ACPI power button
(reboot likewise), so your VM tries to shut down cleanly. That could take
time, and it could hang (I guess). I don't use VMware, but maybe there's a
"reset" action that presses the virtual reset button of the virtual
hardware... ;-)

Regards,
Ulrich

>
> Thank you in advance.





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Antw: Re: Antw: After reboot each node thinks the other is offline.

2017-08-01 Thread Ulrich Windl

>>> "Stephen Carville (HA List)" <62d2a...@opayq.com> schrieb am 01.08.2017 um
10:05 in Nachricht :
> On 07/31/2017 11:13 PM, Ulrich Windl [Masked] wrote:
> 
>>> I am experimenting with pacemaker for high availability for some load
>>> balancers.  I was able to sucessfully get two CentOS (6.9) machines
>>> (scahadev01da and scahadev01db) to form a cluster and the shared IP was
>>> assigned to scahadev01da.  I simulated a failure by halting the primary
>>> and the secondary eventually noticed bringing up the shared IP on its
>>> eth0.  So far, so good.
>>>
>>> A problem arises when the primary comes back up and, for some reason,
>>> each node thinks the other is offline.  This leads to both nodes adding
>> 
>> If a node thinks the other is unexpectedly offline, it will fence it, and 
> then it will be offline! Thus the IP can't run there. I guess you have no 
> fencing configured, right?
> 
> No. I didn't realize it was necessary unless there was shared storage
> involved.  I guess it is time to go back to the drawing board.  Can
> clustering even be done reliably on CentOS 6?  I have no objection to
> moving to 7 but I was hoping I could get this up quicker than building
> out a bunch of new balancers.
> 
> On a related note: I tried rebooting both nodes and each node still
> thinks the other is offline.  For future reference is there a way to
> clear that?

If you start both nodes (and wait for a while), both nodes should appear as 
online (on each node). If it does not happen, there may be some communication 
or configuration problem. Before investing much time on the old version, I'd go 
forward to the current OS (personal preference)...

Regards,
Ulrich

> 
>> Regards,
>> Ulrich
>> 
>>> the duplicate IP to its own eth0.  I probably do not need to tell you
>>> the mischief that can cause if these were production servers.
>>>
>>> I tried restarting cman, pcsd and pacemaker on both machines with no
>>> effect on the situation.
>>>
>>> I've found several mentions of it in the search engines but I've been
>>> unable to find how to fix it.  Any help is appreciated
>>>
>>> Both nodes have quorum disabled in /etc/sysconfig/cman
>>>
>>> CMAN_QUORUM_TIMEOUT=0
>>>
>>> #
>>> Node 1
>>>
>>> scahadev01da# sudo pcs status
>>> Cluster name: scahadev01d
>>> Stack: cman
>>> Current DC: scahadev01da (version 1.1.15-5.el6-e174ec8) - partition
>>> WITHOUT quorum
>>> Last updated: Mon Jul 31 10:43:54 2017  Last change: Mon Jul 31 
>>> 10:30:46
>>> 2017 by root via cibadmin on scahadev01da
>>>
>>> 2 nodes and 1 resource configured
>>>
>>> Online: [ scahadev01da ]
>>> OFFLINE: [ scahadev01db ]
>>>
>>> Full list of resources:
>>>
>>>  VirtualIP  (ocf::heartbeat:IPaddr2):   Started scahadev01da
>>>
>>> Daemon Status:
>>>   cman: active/enabled
>>>   corosync: active/disabled
>>>   pacemaker: active/enabled
>>>   pcsd: active/enabled
>>>
>>> #
>>> Node 2
>>>
>>> scahadev01db ~]$ sudo pcs status
>>> Cluster name: scahadev01d
>>> Stack: cman
>>> Current DC: scahadev01db (version 1.1.15-5.el6-e174ec8) - partition
>>> WITHOUT quorum
>>> Last updated: Mon Jul 31 10:43:47 2017  Last change: Sat Jul 29 
>>> 13:45:15
>>> 2017 by root via cibadmin on scahadev01da
>>>
>>> 2 nodes and 1 resource configured
>>>
>>> Online: [ scahadev01db ]
>>> OFFLINE: [ scahadev01da ]
>>>
>>> Full list of resources:
>>>
>>>  VirtualIP  (ocf::heartbeat:IPaddr2):   Started scahadev01db
>>>
>>> Daemon Status:
>>>   cman: active/enabled
>>>   corosync: active/disabled
>>>   pacemaker: active/enabled
>>>   pcsd: active/enabled
>>>
>>> --
>>> Stephen Carville
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org 
>>> http://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.

2017-08-01 Thread Stephen Carville (HA List)

On 07/31/2017 11:13 PM, Ulrich Windl [Masked] wrote:

>> I am experimenting with pacemaker for high availability for some load
>> balancers.  I was able to sucessfully get two CentOS (6.9) machines
>> (scahadev01da and scahadev01db) to form a cluster and the shared IP was
>> assigned to scahadev01da.  I simulated a failure by halting the primary
>> and the secondary eventually noticed bringing up the shared IP on its
>> eth0.  So far, so good.
>>
>> A problem arises when the primary comes back up and, for some reason,
>> each node thinks the other is offline.  This leads to both nodes adding
> 
> If a node thinks the other is unexpectedly offline, it will fence it, and 
> then it will be offline! Thus the IP can't run there. I guess you have no 
> fencing configured, right?

No. I didn't realize it was necessary unless there was shared storage
involved.  I guess it is time to go back to the drawing board.  Can
clustering even be done reliably on CentOS 6?  I have no objection to
moving to 7 but I was hoping I could get this up quicker than building
out a bunch of new balancers.

On a related note: I tried rebooting both nodes and each node still
thinks the other is offline.  For future reference is there a way to
clear that?

> Regards,
> Ulrich
> 
>> the duplicate IP to its own eth0.  I probably do not need to tell you
>> the mischief that can cause if these were production servers.
>>
>> I tried restarting cman, pcsd and pacemaker on both machines with no
>> effect on the situation.
>>
>> I've found several mentions of it in the search engines but I've been
>> unable to find how to fix it.  Any help is appreciated
>>
>> Both nodes have quorum disabled in /etc/sysconfig/cman
>>
>> CMAN_QUORUM_TIMEOUT=0
>>
>> #
>> Node 1
>>
>> scahadev01da# sudo pcs status
>> Cluster name: scahadev01d
>> Stack: cman
>> Current DC: scahadev01da (version 1.1.15-5.el6-e174ec8) - partition
>> WITHOUT quorum
>> Last updated: Mon Jul 31 10:43:54 2017   Last change: Mon Jul 31 
>> 10:30:46
>> 2017 by root via cibadmin on scahadev01da
>>
>> 2 nodes and 1 resource configured
>>
>> Online: [ scahadev01da ]
>> OFFLINE: [ scahadev01db ]
>>
>> Full list of resources:
>>
>>  VirtualIP   (ocf::heartbeat:IPaddr2):   Started scahadev01da
>>
>> Daemon Status:
>>   cman: active/enabled
>>   corosync: active/disabled
>>   pacemaker: active/enabled
>>   pcsd: active/enabled
>>
>> #
>> Node 2
>>
>> scahadev01db ~]$ sudo pcs status
>> Cluster name: scahadev01d
>> Stack: cman
>> Current DC: scahadev01db (version 1.1.15-5.el6-e174ec8) - partition
>> WITHOUT quorum
>> Last updated: Mon Jul 31 10:43:47 2017   Last change: Sat Jul 29 
>> 13:45:15
>> 2017 by root via cibadmin on scahadev01da
>>
>> 2 nodes and 1 resource configured
>>
>> Online: [ scahadev01db ]
>> OFFLINE: [ scahadev01da ]
>>
>> Full list of resources:
>>
>>  VirtualIP   (ocf::heartbeat:IPaddr2):   Started scahadev01db
>>
>> Daemon Status:
>>   cman: active/enabled
>>   corosync: active/disabled
>>   pacemaker: active/enabled
>>   pcsd: active/enabled
>>
>> --
>> Stephen Carville
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Antw: DRBD and SSD TRIM - Slow!

[ClusterLabs] DRBD and SSD TRIM - Slow!

[ClusterLabs] from where does the default value for start/stop op of a resource come ?

Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.

Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off

Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.

[ClusterLabs] Clusterlabs Summit 2017 (Sept. 6-7 in Nuremberg) - One month left!

Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off

Re: [ClusterLabs] Antw: DRBD AND cLVM ???

Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off

[ClusterLabs] Antw: Re: Antw: After reboot each node thinks the other is offline.

Re: [ClusterLabs] Antw: After reboot each node thinks the other is offline.

12 matches

Site Navigation

Mail list logo

Footer information