Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Andrei Borzenkov
On Wed, Oct 9, 2019 at 10:59 AM Kadlecsik József wrote: > > Hello, > > The nodes in our cluster have got backend and frontend interfaces: the > former ones are for the storage and cluster (corosync) traffic and the > latter ones are for the public services of KVM guests only. > > One of the nodes

Re: [ClusterLabs] Where to find documentation for cluster MD?

2019-10-10 Thread Andrei Borzenkov
On Thu, Oct 10, 2019 at 11:16 AM Ulrich Windl wrote: > > Hi! > > In recent SLES there is "cluster MD", like in > cluster-md-kmp-default-4.12.14-197.18.1.x86_64 > (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko). > However I could not find any manual page for it. > > Where

Re: [ClusterLabs] Why is node fenced ?

2019-10-10 Thread Andrei Borzenkov
10.10.2019 18:22, Lentes, Bernd пишет: > HI, > > i have a two node cluster running on SLES 12 SP4. > I did some testing on it. > I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a few > minutes later because i made a mistake. > ha-idg-2 was DC. ha-idg-1 made a fresh boot and i

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Coming in Pacemaker 2.0.4: shutdown locks

2020-02-27 Thread Andrei Borzenkov
28.02.2020 01:55, Ken Gaillot пишет: > On Thu, 2020-02-27 at 22:39 +0300, Andrei Borzenkov wrote: >> 27.02.2020 20:54, Ken Gaillot пишет: >>> On Thu, 2020-02-27 at 18:43 +0100, Jehan-Guillaume de Rorthais >>> wrote: >>>>>> Speaking about s

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Coming in Pacemaker 2.0.4: shutdown locks

2020-02-27 Thread Andrei Borzenkov
27.02.2020 20:54, Ken Gaillot пишет: > On Thu, 2020-02-27 at 18:43 +0100, Jehan-Guillaume de Rorthais wrote: Speaking about shutdown, what is the status of clean shutdown of the cluster handled by Pacemaker? Currently, I advice to stop resources gracefully (eg. using pcs

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Andrei Borzenkov
05.02.2020 20:55, Eric Robinson пишет: > The two servers 001db01a and 001db01b were up and responsive. Neither had > been rebooted and neither were under heavy load. There's no indication in the > logs of loss of network connectivity. Any ideas on why both nodes seem to > think the other one is

Re: [ClusterLabs] multi-site clusters vs disaster recovery clusters

2020-02-05 Thread Andrei Borzenkov
05.02.2020 18:16, Олег Самойлов пишет: > Hi all. > > I am reading the documentation about new (for me) pacemaker, which came with > RedHat 8. > > And I see two different chapters, which both tried to solve exactly the same > problem. > > One is CONFIGURING DISASTER RECOVERY CLUSTERS (pcs dr):

Re: [ClusterLabs] Understanding advisory resource ordering

2020-01-11 Thread Andrei Borzenkov
08.01.2020 17:30, Achim Leitner пишет: > Hi, > > some progress on this issue: > > Am 20.12.19 um 13:37 schrieb Achim Leitner: >> After pacemaker restart, we have Transition 0 with the DRBD actions, >> followed 4s later with Transition 1 including all VM actions with >> correct ordering. 32s

Re: [ClusterLabs] Making xt_cluster IP load-sharing work with IPv6 (Was: Concept of a Shared ipaddress/resource for generic applicatons)[

2020-01-11 Thread Andrei Borzenkov
04.01.2020 01:42, Valentin Vidić пишет: > On Thu, Jan 02, 2020 at 09:52:09PM +0100, Jan Pokorný wrote: >> What you've used appears to be akin to what this chunk of manpage >> suggests (amongst others): >> https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.man >> >> which is (yet

Re: [ClusterLabs] Making xt_cluster IP load-sharing work with IPv6

2020-01-14 Thread Andrei Borzenkov
14.01.2020 17:47, Jan Pokorný пишет: > On 11/01/20 19:47 +0300, Andrei Borzenkov wrote: >> 04.01.2020 01:42, Valentin Vidić пишет: >>> On Thu, Jan 02, 2020 at 09:52:09PM +0100, Jan Pokorný wrote: >>>> What you've used appears to be akin to what this chunk of manpage

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-08 Thread Andrei Borzenkov
08.04.2020 10:12, Jan Friesse пишет: > Sherrard, > >> i could not determine which of these sub-threads to include this in, >> so i am going to (reluctantly) top-post it. >> >> i switched the transport to udp, and in limited testing i seem to not >> be hitting the race condition. of course i have

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-07 Thread Andrei Borzenkov
07.04.2020 00:21, Sherrard Burton пишет: >> >> It looks like some timing issue or race condition. After reboot node >> manages to contact qnetd first, before connection to other node is >> established. Qnetd behaves as documented - it sees two equal size >> partitions and favors the partition that

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-06 Thread Andrei Borzenkov
06.04.2020 20:57, Sherrard Burton пишет: > > > On 4/6/20 1:20 PM, Sherrard Burton wrote: >> >> >> On 4/6/20 12:35 PM, Andrei Borzenkov wrote: >>> 06.04.2020 17:05, Sherrard Burton пишет: >>>> >>>> from the quorum node: >> .

Re: [ClusterLabs] unable to start fence_scsi on a new add node

2020-04-19 Thread Andrei Borzenkov
16.04.2020 18:58, Stefan Sabolowitsch пишет: > Hi there, > i have expanded a cluster with 2 nodes with an additional one "elastic-03". > However, fence_scsi does not start on the new node. > > pcs-status: > [root@logger cluster]# pcs status > Cluster name: cluster_elastic > Stack: corosync >

Re: [ClusterLabs] fence_mpath and failed IP

2020-03-31 Thread Andrei Borzenkov
31.03.2020 05:56, Ken Gaillot пишет: > On Sat, 2020-02-22 at 03:50 +0200, Strahil Nikolov wrote: >> Hello community, >> >> Recently I have started playing with fence_mpath and I have noticed >> that when the node is fenced, the node is kicked out of the >> cluster (corosync & pacemaker are shut

Re: [ClusterLabs] Is 20 seconds to complete redis switchover to be expected?

2020-03-31 Thread Andrei Borzenkov
31.03.2020 09:27, steven prothero пишет: > Hello, > > I am new with Pacemaker (new to redis also) and appreciate the info shared > here. > > I believe with Redis sentinel a switchover is about 2 seconds. > Reading a post about Pacemaker with Redis, the author said he was > doing it in 3

Re: [ClusterLabs] Merging partitioned two_node cluster?

2020-05-04 Thread Andrei Borzenkov
05.05.2020 06:39, Nickle, Richard пишет: > I have a two node cluster managing a VIP. The service is an SMTP service. > This could be active/active, it doesn't matter which node accepts the SMTP > connection, but I wanted to make sure that a VIP was in place so that there > was a well-known

Re: [ClusterLabs] Merging partitioned two_node cluster?

2020-05-05 Thread Andrei Borzenkov
my base network in 'bindnetaddr' > doesn't account for networks with CIDR mask bits greater than 24? (which > would have non-zero least significant bytes.) > > Thanks, > > Rick > > > > > > On Tue, May 5, 2020 at 12:03 PM Andrei Borzenkov > wrote: > &

Re: [ClusterLabs] Merging partitioned two_node cluster?

2020-05-05 Thread Andrei Borzenkov
05.05.2020 16:44, Nickle, Richard пишет: > Thanks Honza and Andrei (and Strahil? I might have missed a message in the > thread...) > Yep, all messages from Strahil end up in spam folder. ___ Manage your subscription:

Re: [ClusterLabs] Coming in Pacemaker 2.0.4: fencing delay based on what resources are where

2020-03-21 Thread Andrei Borzenkov
21.03.2020 20:07, Ken Gaillot пишет: > Hi all, > > I am happy to announce a feature that was discussed on this list a > while back. It will be in Pacemaker 2.0.4 (the first release candidate > is expected in about three weeks). > > A longstanding concern in two-node clusters is that in a split

Re: [ClusterLabs] Avoiding self-fence on RA failure

2020-10-06 Thread Andrei Borzenkov
07.10.2020 06:42, Digimer пишет: > Hi all, > > While developing our program (and not being a production cluster), I > find that when I push broken code to a node, causing the RA to fail to > perform an operation, the node gets fenced. (example below). > > This brings up a question; > > If

Re: [ClusterLabs] ocf:pacemaker:ping every X seconds

2020-10-09 Thread Andrei Borzenkov
09.10.2020 08:21, Rohit Saini пишет: > Hi Team, > I am using ocf:pacemaker:ping resource to check aliveness of a machine > every X seconds. As I understand, monitor interval 'Y' will cause ping to > happen every 'Y' seconds. So, for my case, Y should be equal to X? > I do not see this behavior

Re: [ClusterLabs] Two ethernet adapter within same subnet causing issue on Qdevice

2020-10-06 Thread Andrei Borzenkov
05.10.2020 20:55, Richard Seo пишет: > >> Create host route via specific device. > I've looked over the docs, haven't found a way to do this. I've tried > configuring corosync.conf using the specific ip addresses. Could you > specify > how to route to a specific network adapter from

Re: [ClusterLabs] Behavior of corosync kill

2020-08-25 Thread Andrei Borzenkov
On Tue, Aug 25, 2020 at 10:00 AM Rohit Saini wrote: > > Hi All, > I am seeing the following behavior. Can someone clarify if this is intended > behavior. If yes, then why so? Please let me know if logs are needed for > better clarity. > > 1. Without Stonith: > Continuous corosync kill on master

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Stonith failing

2020-08-18 Thread Andrei Borzenkov
18.08.2020 10:35, Ulrich Windl пишет: >>>> Andrei Borzenkov schrieb am 18.08.2020 um 09:24 in > Nachricht <83aba38d-c9ea-1dff-e53b-14a9e0623...@gmail.com>: >> 18.08.2020 10:10, Ulrich Windl пишет: >>>>>> Ken Gaillot sc

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-08-17 Thread Andrei Borzenkov
17.08.2020 23:39, Jehan-Guillaume de Rorthais пишет: > On Mon, 17 Aug 2020 10:19:45 -0500 > Ken Gaillot wrote: > >> On Fri, 2020-08-14 at 15:09 +0200, Gabriele Bulfon wrote: >>> Thanks to all your suggestions, I now have the systems with stonith >>> configured on ipmi. >> >> A word of caution:

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

2020-08-18 Thread Andrei Borzenkov
18.08.2020 10:10, Ulrich Windl пишет: Ken Gaillot schrieb am 17.08.2020 um 17:19 in > Nachricht > <73d6ecf113098a3154a2e7db2e2a59557272024a.ca...@redhat.com>: >> On Fri, 2020‑08‑14 at 15:09 +0200, Gabriele Bulfon wrote: >>> Thanks to all your suggestions, I now have the systems with stonith

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-08-18 Thread Andrei Borzenkov
18.08.2020 17:02, Ken Gaillot пишет: > On Tue, 2020-08-18 at 08:21 +0200, Klaus Wenninger wrote: >> On 8/18/20 7:49 AM, Andrei Borzenkov wrote: >>> 17.08.2020 23:39, Jehan-Guillaume de Rorthais пишет: >>>> On Mon, 17 Aug 2020 10:19:45 -0500 >>>> Ken Gai

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-08-19 Thread Andrei Borzenkov
18.08.2020 22:49, Klaus Wenninger пишет: >>> What I'm not sure about is how watchdog-only sbd would behave as a >>> fail-back method for a regular fence device. Will the cluster wait for >>> the sbd timeout no matter what, or only if the regular fencing fails, >>> or ...? >>> >> Diskless SBD

Re: [ClusterLabs] Coming in Pacemaker 2.0.5: better start-up/shutdown coordination with sbd

2020-08-22 Thread Andrei Borzenkov
21.08.2020 21:16, Ken Gaillot пишет: > > Previously at shutdown, sbd determined a clean pacemaker shutdown by > checking whether any resources were running at shutdown. This would > lead to sbd fencing if pacemaker shut down in maintenance mode with > resources active. What conditions lead to

Re: [ClusterLabs] [ClusterLabs Developers] Fencing with a Quorum Device

2020-08-26 Thread Andrei Borzenkov
I changed list to users because it is general usage question, not development topic. 26.08.2020 23:33, Hayden Pfeiffer пишет: > Hello, > > > I am in the process of configuring fencing in an AWS cluster of two > hosts. I have done so and nodes are correctly fenced when > communication is broken

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-08-17 Thread Andrei Borzenkov
17.08.2020 10:06, Klaus Wenninger пишет: >> >>> Alternatively, you can set up corosync-qdevice, using a separate system >>> running qnetd server as a quorum arbitrator. >>> >> Any solution that is based on node suicide is prone to complete cluster >> loss. In particular, in two node cluster with

Re: [ClusterLabs] SBD fencing not working on my two-node cluster

2020-09-22 Thread Andrei Borzenkov
22.09.2020 02:06, Philippe M Stedman пишет: > Hi Strahil, > > Here is the output of those commands I appreciate the help! > > # crm config show > node 1: ceha03 \ > attributes ethmonitor-ens192=1 > node 2: ceha04 \ > attributes ethmonitor-ens192=1 > (...) > primitive

Re: [ClusterLabs] Resources always return to original node

2020-09-26 Thread Andrei Borzenkov
26.09.2020 12:22, Michael Ivanov пишет: > Hallo, > > I have strange problem: when I reset the node on which my resources are > running, > they are correctly migrated to the other node. But when I turn the failed > node > back, then as soon as it is up all resources are returned back to it. I

Re: [ClusterLabs] Two ethernet adapter within same subnet causing issue on Qdevice

2020-10-01 Thread Andrei Borzenkov
01.10.2020 20:09, Richard Seo пишет: > Hello everyone, > I'm trying to setup a cluster with two hosts: > both have two ethernet adapters all within the same subnet. > I've created resources for an adapter for each hosts. > Here is the example: > Stack: corosync > Current DC: ceha06 (version

Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-23 Thread Andrei Borzenkov
23.10.2020 21:08, Lentes, Bernd пишет: > > Surprisingly if the virsh destroy is successfull the RA waits until the > domain isn't running anymore: > ... > > I need someting like that which waits for some time (maybe 30s) if the domain > nevertheless stops although > "virsh destroy" gaves an

Re: [ClusterLabs] fence_scsi problem

2020-10-28 Thread Andrei Borzenkov
On Wed, Oct 28, 2020 at 3:18 PM Patrick Vranckx wrote: > > Hi, > > I try yo setup an HA cluster for ZFS. I think fence_scsi is not working > properly. I can reproduce the problem on two kind of hardware: iSCSI and > SAS storage. > > Here is what I did: > > - set up a storage server with 3 iscsi

Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

2020-10-22 Thread Andrei Borzenkov
22.10.2020 23:29, Lentes, Bernd пишет: > Hi guys, > > ocassionally stopping a VirtualDomain resource via "crm resource stop" does > not work, and in the end the node is fenced, which is ugly. > I had a look at the RA to see what it does. After trying to stop the domain > via "virsh shutdown

Re: [ClusterLabs] Adding a node to an active cluster

2020-10-21 Thread Andrei Borzenkov
nges may be overwritten by pacemaker? > 2. Do you have idea where(which config file) crm_node command retrieves its > data? CIB > Thanks, > Jiaqi Tian > > - Original message - > From: Andrei Borzenkov > Sent by: "Users" > To: Cluster

Re: [ClusterLabs] Adding a node to an active cluster

2020-10-21 Thread Andrei Borzenkov
21.10.2020 20:47, Strahil Nikolov пишет: > Both SUSE and RedHat provide utilities to add the node without messing with > the configs manually. Which are crmsh and pcs respectively :) > > What is your distro ? > > > Best Regards, > Strahil Nikolov > > > > > > > В сряда, 21 октомври 2020

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-19 Thread Andrei Borzenkov
02.07.2020 18:18, stefan.schm...@farmpartner-tec.com пишет: > Hello, > > I hope someone can help with this problem. We are (still) trying to get > Stonith to achieve a running active/active HA Cluster, but sadly to no > avail. > > There are 2 Centos Hosts. On each one there is a virtual Ubuntu

[ClusterLabs] fence_virt architecture? (was: Re: Still Beginner STONITH Problem)

2020-07-19 Thread Andrei Borzenkov
18.07.2020 03:36, Reid Wahl пишет: > I'm not sure that the libvirt backend is intended to be used in this way, > with multiple hosts using the same multicast address. From the > fence_virt.conf man page: > > ~~~ > BACKENDS >libvirt >The libvirt plugin is the simplest plugin. It

Re: [ClusterLabs] Pacemaker Shutdown

2020-07-22 Thread Andrei Borzenkov
On Wed, Jul 22, 2020 at 9:42 AM Harvey Shepherd < harvey.sheph...@aviatnet.com> wrote: > Hi All, > > I'm running Pacemaker 2.0.3 on a two-node cluster, controlling 40+ > resources which are a mixture of clones and other resources that are > colocated with the master instance of certain clones.

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Andrei Borzenkov
On Wed, Jul 22, 2020 at 10:59 AM Хиль Эдуард wrote: > Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on ubuntu 20 > + 1 qdevice. I want to define new resource as systemd unit *dummy.service > *: > > [Unit] > Description=Dummy > [Service] > Restart=on-failure >

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Andrei Borzenkov
On Wed, Jul 22, 2020 at 4:58 PM Ken Gaillot wrote: > On Wed, 2020-07-22 at 10:59 +0300, Хиль Эдуард wrote: > > Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on > > ubuntu 20 + 1 qdevice. I want to define new resource as systemd > > unit dummy.service : > > > > [Unit] > >

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-30 Thread Andrei Borzenkov
30.07.2020 08:42, Strahil Nikolov пишет: > You got plenty of options: > - IPMI based fencing like HP iLO, DELL iDRAC > - SCSI-3 persistent reservations (which can be extended to fence the node > when the reservation(s) were removed) > SCSI reservation prevents data corruption due to

Re: [ClusterLabs] Automatic recover from split brain ?

2020-08-11 Thread Andrei Borzenkov
11.08.2020 10:34, Adam Cécile пишет: > On 8/11/20 8:48 AM, Andrei Borzenkov wrote: >> 08.08.2020 13:10, Adam Cécile пишет: >>> Hello, >>> >>> >>> I'm experiencing issue with corosync/pacemaker running on Debian Buster. >>> Clu

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-30 Thread Andrei Borzenkov
On Thu, Jul 30, 2020 at 11:29 AM Strahil Nikolov wrote: > > This one links to how to power fence when reservations are removed: > https://access.redhat.com/solutions/4526731 > All of this is RH(CS) specific ___ Manage your subscription:

Re: [ClusterLabs] Automatic recover from split brain ?

2020-08-11 Thread Andrei Borzenkov
08.08.2020 13:10, Adam Cécile пишет: > Hello, > > > I'm experiencing issue with corosync/pacemaker running on Debian Buster. > Cluster has three nodes running in VMWare virtual machine and the > cluster fails when VEEAM backups the virtual machine (I know it's doing > bad things, like freezing

Re: [ClusterLabs] cluster problems after let's encrypt

2020-07-06 Thread Andrei Borzenkov
06.07.2020 19:13, fatcha...@gmx.de пишет: > Hi, > > I'm running a two node corosync httpd-cluster on a CentOS 7. > corosync-2.4.5-4.el7.x86_64 > pcs-0.9.168-4.el7.centos.x86_64 > Today I used lets encrypt to installt https for two domains on that system. > After that the node with the new

Re: [ClusterLabs] qnetd and booth arbitrator running together in a 3rd geo site

2020-07-14 Thread Andrei Borzenkov
14.07.2020 13:19, Rohit Saini пишет: > Also, " Keep in mind that neither qdevice nor booth is "replacement" for > stonith. " > > Why not? qdevice/booth are handling the split-brain scenario, keeping one > master only even in case of local/geo network disjoints. Can you please > clarify more on

Re: [ClusterLabs] Automatic restart of Pacemaker after reboot and filesystem unmount problem

2020-07-14 Thread Andrei Borzenkov
14.07.2020 14:56, Grégory Sacré пишет: > Dear all, > > > I'm pretty new to Pacemaker so I must be missing something but I cannot find > it in the documentation. > > I'm setting up a SAMBA File Server cluster with DRBD and Pacemaker. Here are > the relevant pcs commands related to the mount

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Andrei Borzenkov
18.06.2020 18:24, Ken Gaillot пишет: > Note that a failed start of a stonith device will not prevent the > cluster from using that device for fencing. It just prevents the > cluster from monitoring the device. > My understanding is that if stonith resource cannot run anywhere, it also won't be

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Two node cluster and extended distance/site failure

2020-06-24 Thread Andrei Borzenkov
24.06.2020 12:20, Ulrich Windl пишет: >> >> How Service Guard handles loss of shared storage? > > When a node is up it would log the event; if a node is down it wouldn't care; > if a node detects a communication problem with the other node, it would fence > itself. > So in case of split brain

Re: [ClusterLabs] Antw: [EXT] Two node cluster and extended distance/site failure

2020-06-24 Thread Andrei Borzenkov
24.06.2020 10:28, Ulrich Windl пишет: >> >> Usual recommendation is third site which functions as witness. This >> works fine up to failure of this third site itself. Unavailability of >> the witness makes normal maintenance of either of two nodes impossible. > > That's a problem of pacemaker: >

[ClusterLabs] Two node cluster and extended distance/site failure

2020-06-24 Thread Andrei Borzenkov
Two node is what I almost exclusively deal with. It works reasonably well in one location where failures to perform fencing are rare and can be mitigated by two different fencing methods. Usually SBD is reliable enough, as failure of shared storage also implies failure of the whole cluster. When

Re: [ClusterLabs] [Off-topic] Message threading (Was: Antw: [EXT] Re: Two node cluster and extended distance/site failure)

2020-06-29 Thread Andrei Borzenkov
29.06.2020 14:57, Ulrich Windl пишет: Klaus Wenninger schrieb am 29.06.2020 um 10:12 in > Nachricht > > [...] >> My mailer was confused by all this combinations of >> "Antw: Re: Antw:" anddidn't compose mails into a >> thread properly. Which is why I missed further >> discussion where it

Re: [ClusterLabs] Antw: [EXT] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Andrei Borzenkov
29.06.2020 20:20, Tony Stocker пишет: > >> >> >> The most interesting part seems to be the question whow you define (and >> detect) a failure that will cause a node switch. > > That is a VERY good question! How many mounts failed is the critical > number when you have 130+? If a single one

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-19 Thread Andrei Borzenkov
19.06.2020 01:13, Howard пишет: > Thanks for all the help so far. With your assistance, I'm very close to > stable. > > Made the following changes to the vmfence stonith resource: > > Meta Attrs: failure-timeout=30m migration-threshold=10 > Operations: monitor interval=60s

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-19 Thread Andrei Borzenkov
e.  After 30 minutes it will start trying again. >> >> On Thu, Jun 18, 2020 at 12:29 PM Ken Gaillot > <mailto:kgail...@redhat.com>> wrote: >> >> On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote: >> > 18.06.2020 18:24, Ken Gaillot пишет: >

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Andrei Borzenkov
18.06.2020 20:16, Howard пишет: > Thanks for the replies! I will look at the failure-timeout resource > attribute and at adjusting the timeout from 20 to 30 seconds. It is funny > that the 100 tries message is symbolic. > It is not symbolic, it is INFINITY. From pacemaker documentation If

Re: [ClusterLabs] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-17 Thread Andrei Borzenkov
17.06.2020 22:05, Howard пишет: > Hello, recently I received some really great advice from this community > regarding changing the token timeout value in corosync. Thank you! Since > then the cluster has been working perfectly with no errors in the log for > more than a week. > > This morning I

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-07-29 Thread Andrei Borzenkov
On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon wrote: > That one was taken from a specific implementation on Solaris 11. > The situation is a dual node server with shared storage controller: both > nodes see the same disks concurrently. > Here we must be sure that the two nodes are not going to

Re: [ClusterLabs] fence_virt architecture? (was: Re: Still Beginner STONITH Problem)

2020-07-20 Thread Andrei Borzenkov
On Mon, Jul 20, 2020 at 11:45 AM Klaus Wenninger wrote: > On 7/20/20 10:34 AM, Andrei Borzenkov wrote: > > > > >> >> The cpg-configuration sounds interesting as well. Haven't used >> it or looked into the details. Would be interested to hear about >&

Re: [ClusterLabs] fence_virt architecture? (was: Re: Still Beginner STONITH Problem)

2020-07-20 Thread Andrei Borzenkov
t (libvirt network was in NAT mode) or wrong (VMs using Host's bond > in a bridged network). > > > > Best Regards, > > Strahil Nikolov > > > > На 19 юли 2020 г. 9:45:29 GMT+03:00, Andrei Borzenkov < > arvidj...@gmail.com> написа: > >> 18.07.2020 03:36

Re: [ClusterLabs] pacemaker systemd resource

2020-07-22 Thread Andrei Borzenkov
maker-based     [1719] (cib_process_request)   >    info: Completed cib_delete operation for section status: OK (rc=0, > origin=node1.local/crmd/246, version=0.132.5) > Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)      > info: Diff: --- 0.13

Re: [ClusterLabs] Antw: [EXT] why is node fenced ?

2020-07-30 Thread Andrei Borzenkov
30.07.2020 23:23, Lentes, Bernd пишет: > > > - Am 30. Jul 2020 um 9:28 schrieb Ulrich Windl > ulrich.wi...@rz.uni-regensburg.de: > > "Lentes, Bernd" schrieb am 29.07.2020 >> um >> 17:26 in Nachricht >> <1894379294.27456141.1596036406000.javamail.zim...@helmholtz-muenchen.de>: >>> Hi,

Re: [ClusterLabs] Antw: [EXT] Stonith failing

2020-08-16 Thread Andrei Borzenkov
16.08.2020 04:25, Reid Wahl пишет: > > >> - considering that I have both nodes with stonith against the other node, >> once the two nodes can communicate, how can I be sure the two nodes will >> not try to stonith each other? >> > > The simplest option is to add a delay attribute (e.g.,

Re: [ClusterLabs] Antw: [EXT] Re: resource management of standby node

2020-11-30 Thread Andrei Borzenkov
30.11.2020 17:05, Ulrich Windl пишет: >>>> Andrei Borzenkov schrieb am 30.11.2020 um 14:18 in > Nachricht > : >> On Mon, Nov 30, 2020 at 3:11 PM Ulrich Windl >> wrote: >>> >>> Hi! >>> >>> In SLES15 I'm surprised what a stan

Re: [ClusterLabs] resource management of standby node

2020-11-30 Thread Andrei Borzenkov
On Mon, Nov 30, 2020 at 3:11 PM Ulrich Windl wrote: > > Hi! > > In SLES15 I'm surprised what a standby node does: My guess was that a standby > node would stop all resources and then just "shut up", but it seems it still > tried to place resources and calls monitor operations. > Standby nodes

Re: [ClusterLabs] Q: LVM-activate: "WARNING: You are recommended to activate one LV at a time or use exclusive activation mode."

2020-11-30 Thread Andrei Borzenkov
30.11.2020 15:36, Ulrich Windl пишет: > Hi! > > I configured a shared LVM activation as per instructions (I hope) in SLES15 > SP2. However I get this warning: > LVM-activate(prm_testVG_activate)[57281]: WARNING: You are recommended to > activate one LV at a time or use exclusive activation

Re: [ClusterLabs] Antw: [EXT] Re: Preferred node for a service (not constrained)

2020-12-03 Thread Andrei Borzenkov
On Thu, Dec 3, 2020 at 11:11 AM Ulrich Windl wrote: > > >>> Strahil Nikolov schrieb am 02.12.2020 um 22:42 in > Nachricht <311137659.2419591.1606945369...@mail.yahoo.com>: > > Constraints' values are varying from: > > infinity which equals to score of 100 > > to: > > - infinity which equals

Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-17 Thread Andrei Borzenkov
ched off. You really need to test how ipmi behaves with your specific hardware to make sure it is not possible or to adjust stonith agent to handle delays. To reiterate: > > Da: Andrei Borzenkov > > It is possible that your IPMI/BMC/whatever implementation responds > with success bef

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-14 Thread Andrei Borzenkov
On Mon, Dec 14, 2020 at 2:40 PM Gabriele Bulfon wrote: > > I isolated the log when everything happens (when I disable the ha interface), > attached here. > And where are matching logs from the second node? ___ Manage your subscription:

Re: [ClusterLabs] Best way to create a floating identity file

2020-12-15 Thread Andrei Borzenkov
15.12.2020 17:10, Tony Stocker пишет: > On Tue, Dec 15, 2020 at 9:02 AM Andrei Borzenkov wrote: >> >> On Tue, Dec 15, 2020 at 4:58 PM Tony Stocker wrote: >>> >> >> You could simply query whether a specific resource (group) is active >> on the nod

Re: [ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-17 Thread Andrei Borzenkov
17.12.2020 14:02, Ulrich Windl пишет: >>>> Andrei Borzenkov schrieb am 17.12.2020 um 09:50 in > Nachricht > : > > ... >> According to logs from xstha1, it started to activate resources only >> after stonith was confirmed >> >> Dec 16 15

Re: [ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-17 Thread Andrei Borzenkov
17.12.2020 21:30, Ken Gaillot пишет: > > This reminded me that some IPMI implementations return "success" for > commands before they've actually been completed. This is why > fence_ipmilan has a "power_wait" parameter that defaults to 2 seconds. > But on this case we also do not know whether

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-17 Thread Andrei Borzenkov
18.12.2020 10:09, Ulrich Windl пишет: >>>> Andrei Borzenkov schrieb am 18.12.2020 um 08:01 in > Nachricht : >> 17.12.2020 21:30, Ken Gaillot пишет: >>> >>> This reminded me that some IPMI implementations return "success" for >>> co

Re: [ClusterLabs] Q: warning: new_event_notification (4527-22416-14): Broken pipe (32)

2020-12-18 Thread Andrei Borzenkov
18.12.2020 12:00, Ulrich Windl пишет: > > Maybe a related question: Do STONITH resources have special rules, meaning > they don't wait for successful fencing? pacemaker resources in CIB do not perform fencing. They only register fencing devices with fenced which does actual job. In particular

Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure

2020-12-12 Thread Andrei Borzenkov
Sonicle S.r.l. : http://www.sonicle.com > Music: http://www.gabrielebulfon.com > eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets >   > > > > > -- > > Da: Andrei Borz

Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure

2020-12-11 Thread Andrei Borzenkov
11.12.2020 18:37, Gabriele Bulfon пишет: > I found I can do this temporarily: >   > crm config property cib-bootstrap-options: no-quorum-policy=ignore >   All two node clusters I remember run with setting forever :) > then once node 2 is up again: >   > crm config property cib-bootstrap-options:

Re: [ClusterLabs] Running shell command on remote node via corosync messaging infrastructure

2020-12-18 Thread Andrei Borzenkov
18.12.2020 21:54, Ken Gaillot пишет: > On Fri, 2020-12-18 at 17:51 +, Animesh Pande wrote: >> Hello, >> >> Is there a tool that would allow for commands to be run on remote >> nodes in the cluster through the corosync messaging layer? I have a >> cluster configured with multiple corosync

Re: [ClusterLabs] Best way to create a floating identity file

2020-12-15 Thread Andrei Borzenkov
On Tue, Dec 15, 2020 at 4:58 PM Tony Stocker wrote: > > I'm trying to figure out the best way to do the following on our > 2-node clusters. > > Whichever node is the primary (all services run on a single node) I > want to create a file that contains an identity descriptor, e.g. >

Re: [ClusterLabs] Can't have 2 nodes as master with galera resource agent

2020-12-11 Thread Andrei Borzenkov
11.12.2020 16:13, Raphael Laguerre пишет: > Hello, > > I'm trying to setup a 2 nodes cluster with 2 galera instances. I use the > ocf:heartbeat:galera resource agent, however, after I create the resource, > only one node appears to be in master role, the other one can't be promoted > and stays

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

2020-12-15 Thread Andrei Borzenkov
gt; Gabriele > > > Sonicle S.r.l. : http://www.sonicle.com > Music: http://www.gabrielebulfon.com > eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets > > > > > ------ > > Da:

Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Andrei Borzenkov
16.12.2020 17:56, Gabriele Bulfon пишет: > Thanks, here are the logs, there are infos about how it tried to start > resources on the nodes. Both logs are from the same node. > Keep in mind the node1 was already running the resources, and I simulated a > problem by turning down the ha

Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-16 Thread Andrei Borzenkov
16.12.2020 19:05, Gabriele Bulfon пишет: > Looking at the two logs, looks like corosync decided that xst1 was offline, > while xst was still online. > I just issued an "ifconfig ha0 down" on xst1, so I expect both nodes cannot > see other one, while I see these same lines both on xst1 and xst2

Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

2020-12-17 Thread Andrei Borzenkov
On Thu, Dec 17, 2020 at 11:11 AM Gabriele Bulfon wrote: > > Yes, sorry took same bash by mistake...here are the correct logs. > > Yes, xstha1 has delay 10s so that I'm giving him precedence, xstha2 has delay > 1s and will be stonished earlier. > During the short time before xstha2 got powered

Re: [ClusterLabs] stop a node

2020-11-15 Thread Andrei Borzenkov
15.11.2020 20:00, Guy Przytula пишет: > a question would be : > > we have maintenance to perform on a node of the cluster > > to avoid that the cluster starts the resource that we stopped - we want > to disable a node temporarily - is this possible without deleting the node > Put node in

Re: [ClusterLabs] Adding a node to an active cluster

2020-10-21 Thread Andrei Borzenkov
On Wed, Oct 21, 2020 at 5:03 PM Jiaqi Tian1 wrote: > > Hi, > I'm trying to add a new node into an active pacemaker cluster with resources > up and running. > After steps: > 1. update corosync.conf files among all hosts in cluster including the new > node > 2. copy corosync auth file to the new

Re: [ClusterLabs] CCIB migration from Pacemaker 1.x to 2.x

2021-01-23 Thread Andrei Borzenkov
23.01.2021 19:10, Sharma, Jaikumar пишет: > Hi guys, > > I'm newbie to high availability clusters, pls excuse me - learning tools > stack (corosync & pacemaker). > > In fact, our high availability solution is based on Debian 9.x (pacemaker 1.x > and corosync 2.x) - which worked as expected. >

Re: [ClusterLabs] Stopping all nodes causes servers to migrate

2021-01-25 Thread Andrei Borzenkov
On Mon, Jan 25, 2021 at 12:07 PM Jehan-Guillaume de Rorthais wrote: > As actions during a cluster shutdown cannot be handled in the same transition > for each nodes, I usually add a step to disable all resources using property > "stop-all-resources" before shutting down the cluster: > > pcs

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Cluster breaks after pcs unstandby node

2021-01-18 Thread Andrei Borzenkov
On Mon, Jan 18, 2021 at 12:00 PM Steffen Vinther Sørensen wrote: > > Hi, > > I have persistent journal, but 'journalctl -b -1' was empty in this > case, so it might not be optimally configured. And centralized logging > is on the todo list > > > btw. about the fencing, I have set '

Re: [ClusterLabs] Q: When do I need virtlockd?

2021-01-18 Thread Andrei Borzenkov
On Mon, Jan 18, 2021 at 11:55 AM Ulrich Windl wrote: . > > So can someone explan, or direct me to some helpful docs? > Are you aware of https://libvirt.org/kbase/locking.html which links further to virtlockd description? ___ Manage your subscription:

Re: [ClusterLabs] Antw: [EXT] Re: Problem with systemd socket service (start fails when running already)

2021-01-31 Thread Andrei Borzenkov
On Mon, Feb 1, 2021 at 10:07 AM Ulrich Windl wrote: > > You are saying starting libvirtd does not require the ro and tls socket units > to be started? > So far I am not aware of any service that would *require* socket activation. Socket activation is optimization that allows you to avoid

Re: [ClusterLabs] failed migration handled the wrong way

2021-02-01 Thread Andrei Borzenkov
On Mon, Feb 1, 2021 at 12:53 PM Ulrich Windl wrote: > > Hi! > > While fighting to get the wrong configuration, I broke libvirt live-migration > by not enabling the TLS socket. > > When testing to live-migrate a VM from h16 to h18, these are the essential > events: > Feb 01 10:30:10 h16

Re: [ClusterLabs] Antw: [EXT] Re: failed migration handled the wrong way

2021-02-01 Thread Andrei Borzenkov
On Mon, Feb 1, 2021 at 1:59 PM Ulrich Windl wrote: > > But the VM *wasn't* stopped on h16! > I am not sure what you mean here. It was not stopped during migration? Yes, pacemaker knew it and it tried to stop it explicitly when migration failed. It was not stopped when pacemaker tried to stop it?

Re: [ClusterLabs] Disable all resources in a group if one or more of them fail and are unable to reactivate

2021-01-27 Thread Andrei Borzenkov
27.01.2021 19:06, damiano giuliani пишет: > Hi all im pretty new to the clusters, im struggling trying to configure a > bounch of resources and test how they failover.my need is to start and > manage a group of resources as one (in order to archive this a resource > group has been created), and if

Re: [ClusterLabs] Disable all resources in a group if one or more of them fail and are unable to reactivate

2021-01-28 Thread Andrei Borzenkov
27.01.2021 22:03, Ken Gaillot пишет: > > With a group, later members depend on earlier members. If an earlier > member can't run, then no members after it can run. > > However we can't make the dependency go in both directions. If an > earlier member can't run unless a later member is active,

Re: [ClusterLabs] Peer (slave) node deleting master's transient_attributes

2021-01-30 Thread Andrei Borzenkov
29.01.2021 20:37, Stuart Massey пишет: > Can someone help me with this? > Background: > > "node01" is failing, and has been placed in "maintenance" mode. It > occasionally loses connectivity. > > "node02" is able to run our resources > > Consider the following messages from pacemaker.log on

<    1   2   3   4   5   6   7   >