Re: [ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

2019-02-25 Thread Andrei Borzenkov
25.02.2019 11:50, Samarth Jain пишет: > Hi, > > > We have a bunch of resources running in master slave configuration with one > master and one slave instance running at any given time. > > What we observe is, that for any two given resources at a time, if say > resource Stateful_Test_1 is in

Re: [ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

2019-02-25 Thread Andrei Borzenkov
26.02.2019 1:08, Ken Gaillot пишет: > On Mon, 2019-02-25 at 23:00 +0300, Andrei Borzenkov wrote: >> 25.02.2019 22:36, Andrei Borzenkov пишет: >>> >>>> Could you please help me understand: >>>> 1. Why doesn't pacemaker process the failure of Stateful_T

Re: [ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

2019-02-25 Thread Andrei Borzenkov
25.02.2019 23:13, Ken Gaillot пишет: > On Mon, 2019-02-25 at 14:20 +0530, Samarth Jain wrote: >> Hi, >> >> >> We have a bunch of resources running in master slave configuration >> with one master and one slave instance running at any given time. >> >> What we observe is, that for any two given

Re: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Andrei Borzenkov
20.02.2019 21:51, Eric Robinson пишет: > > The following should show OK in a fixed font like Consolas, but the following > setup is supposed to be possible, and is even referenced in the ClusterLabs > documentation. > > > > > > +--+ > > | mysql001 +--+ > >

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Andrei Borzenkov
18.02.2019 18:53, Ken Gaillot пишет: > On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote: >> 17.02.2019 0:33, Andrei Borzenkov пишет: >>> 17.02.2019 0:03, Eric Robinson пишет: >>>> Here are the relevant corosync logs. >>>> >>>> It

Re: [ClusterLabs] NFS4 share not working

2019-02-22 Thread Andrei Borzenkov
23.02.2019 2:57, solarflow99 пишет: > I'm trying to have my NFS share exported via pacemaker and now it doesn't > seem to be working, it also kills off nfs-mountd. It looks like the rbd > device could have something to do with it, the nfsroot doesn't get > exported, but there's no indication why:

Re: [ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

2019-02-26 Thread Andrei Borzenkov
26.02.2019 18:05, Ken Gaillot пишет: > On Tue, 2019-02-26 at 06:55 +0300, Andrei Borzenkov wrote: >> 26.02.2019 1:08, Ken Gaillot пишет: >>> On Mon, 2019-02-25 at 23:00 +0300, Andrei Borzenkov wrote: >>>> 25.02.2019 22:36, Andrei Borzenkov пишет: >>>>&g

Re: [ClusterLabs] Two mode cluster VMware drbd

2019-03-12 Thread Andrei Borzenkov
12.03.2019 18:10, Adam Budziński пишет: > Hello, > > > > I’m planning to setup a two node (active-passive) HA cluster consisting of > pacemaker, corosync and DRBD. The two nodes will run on VMware VM’s and > connect to a single DB server (unfortunately for various reasons not > included in the

Re: [ClusterLabs] Interface confusion

2019-03-15 Thread Andrei Borzenkov
16.03.2019 1:16, Adam Budziński пишет: > Hi Tomas, > > Ok but how then pacemaker or the fence agent knows which route to take to > reach the vCenter? They do not know or care at all. It is up to your underlying operating system and its routing tables. > Btw. Do I have to add the stonith

Re: [ClusterLabs] Interface confusion

2019-03-16 Thread Andrei Borzenkov
e stonith agent is not prohibited to run by (co-)location rules. My understanding is that this node is selected by DC in partition. > Thank you! > > sob., 16.03.2019, 05:37 użytkownik Andrei Borzenkov > napisał: > >> 16.03.2019 1:16, Adam Budziński пишет: >>> Hi Tomas, >>

Re: [ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

2019-03-22 Thread Andrei Borzenkov
On Fri, Mar 22, 2019 at 1:08 PM Jan Pokorný wrote: > > Also a Friday's idea: > Perhaps we should crank up "how to ask" manual for this list Yest another one? http://www.catb.org/~esr/faqs/smart-questions.html ___ Manage your subscription:

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Andrei Borzenkov
t away from its current node. In this particular case it may be argued that pacemaker reaction is unjustified. Administrator explicitly set target state to "stop" (otherwise pacemaker would not attempt to stop it) so it is unclear why it tries to restart it on other node. >> -O

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Andrei Borzenkov
17.02.2019 0:03, Eric Robinson пишет: > Here are the relevant corosync logs. > > It appears that the stop action for resource p_mysql_002 failed, and that > caused a cascading series of service changes. However, I don't understand > why, since no other resources are dependent on p_mysql_002. >

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-17 Thread Andrei Borzenkov
17.02.2019 0:33, Andrei Borzenkov пишет: > 17.02.2019 0:03, Eric Robinson пишет: >> Here are the relevant corosync logs. >> >> It appears that the stop action for resource p_mysql_002 failed, and that >> caused a cascading series of service changes. However, I don'

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-17 Thread Andrei Borzenkov
17.02.2019 0:44, Eric Robinson пишет: > Thanks for the feedback, Andrei. > > I only want cluster failover to occur if the filesystem or drbd resources > fail, or if the cluster messaging layer detects a complete node failure. Is > there a way to tell PaceMaker not to trigger a cluster failover

Re: [ClusterLabs] Is fencing really a must for Postgres failover?

2019-02-13 Thread Andrei Borzenkov
13.02.2019 15:50, Maciej S пишет: > Can you describe at least one situation when it could happen? > I see situations where data on two masters can diverge but I can't find the > one where data gets corrupted. If diverged data in two databases that are supposed to be exact copy of each other is

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Andrei Borzenkov
19.02.2019 23:06, Eric Robinson пишет: ... > Bottom line is, how do we configure the cluster in such a way that > there are no cascading circumstances when a MySQL resource fails? > Basically, if a MySQL resource fails, it fails. We'll deal with that > on an ad-hoc basis. I don't want the whole

Re: [ClusterLabs] shutdown and restart of complete cluster due to power outage with UPS

2019-01-24 Thread Andrei Borzenkov
23.01.2019 17:20, Klaus Wenninger пишет: > > And yes dynamic-configuration of two_node should be possible - > remember that I had to implement that communication with > corosync into sbd for clusters that are expanded node-by-node > using pcs. > 'corosync-cfgtool -R' to reload the config. >

Re: [ClusterLabs] shutdown and restart of complete cluster due to power outage with UPS

2019-01-24 Thread Andrei Borzenkov
24.01.2019 18:01, Lentes, Bernd пишет: > - On Jan 23, 2019, at 3:20 PM, Klaus Wenninger kwenn...@redhat.com wrote: >>> I have corosync-2.3.6-9.13.1.x86_64. >>> Where can i configure this value ? >> >> speaking of two_node & wait_for_all? >> That is configured in the quorum-section of

Re: [ClusterLabs] How to reduce SBD watchdog timeout?

2019-04-07 Thread Andrei Borzenkov
03.04.2019 13:04, Klaus Wenninger пишет: > On 4/3/19 9:47 AM, Andrei Borzenkov wrote: >> On Tue, Apr 2, 2019 at 8:49 PM Digimer wrote: >>> It's worth noting that SBD fencing is "better than nothing", but slow. >>> IPMI and/or PDU fencing completes a lot fas

Re: [ClusterLabs] SBD as watchdog daemon

2019-04-14 Thread Andrei Borzenkov
12.04.2019 15:30, Олег Самойлов пишет: > >> 11 апр. 2019 г., в 20:00, Klaus Wenninger >> написал(а): >> >> On 4/11/19 5:27 PM, Олег Самойлов wrote: >>> Hi all. I am developing HA PostgreSQL cluster for 2 or 3 >>> datacenters. In case of DataCenter failure (blackout) the fencing >>> will not

Re: [ClusterLabs] Antw: Re: Antw: Re: Q: ocf:pacemaker:NodeUtilization monitor

2019-06-03 Thread Andrei Borzenkov
03.06.2019 9:09, Ulrich Windl пишет: > 118 if [ ‑x $xentool ]; then > 119 $xentool info | awk >>> '/total_memory/{printf("%d\n",$3);exit(0)}' > 120 else > 121 ocf_log warn "Can only set hv_memory for Xen hypervisor" > 122 echo "0"

Re: [ClusterLabs] EXTERNAL: Re: Pacemaker not reacting as I would expect when two resources fail at the same time

2019-06-08 Thread Andrei Borzenkov
08.06.2019 5:12, Harvey Shepherd пишет: > Thank you for your advice Ken. Sorry for the delayed reply - I was trying out > a few things and trying to capture extra info. The changes that you suggested > make sense, and I have incorporated them into my config. However, the > original issue

Re: [ClusterLabs] Antw: Re: Q: ocf:pacemaker:NodeUtilization monitor

2019-05-29 Thread Andrei Borzenkov
29.05.2019 11:12, Ulrich Windl пишет: Jan Pokorný schrieb am 28.05.2019 um 16:31 in > Nachricht > <20190528143145.ga29...@redhat.com>: >> On 27/05/19 08:28 +0200, Ulrich Windl wrote: >>> I copnfigured ocf:pacemaker:NodeUtilization more or less for fun, and I >> realized that the cluster

Re: [ClusterLabs] How to correctly stop cluster with active stonith watchdog?

2019-05-12 Thread Andrei Borzenkov
30.04.2019 9:53, Digimer пишет: > On 2019-04-30 12:07 a.m., Andrei Borzenkov wrote: >> As soon as majority of nodes are stopped, the remaining nodes are out of >> quorum and watchdog reboot kicks in. >> >> What is the correct procedure to ensure nodes are sto

Re: [ClusterLabs] Constant stop/start of resource in spite of interval=0

2019-05-18 Thread Andrei Borzenkov
18.05.2019 18:34, Kadlecsik József пишет: > Hello, > > We have a resource agent which creates IP tunnels. In spite of the > configuration setting > > primitive tunnel-eduroam ocf:local:tunnel \ > params > op start timeout=120s interval=0 \ > op stop timeout=300s

Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-21 Thread Andrei Borzenkov
21.05.2019 0:46, Ken Gaillot пишет: >> >>> From what's described here, the op-restart-digest is changing every >>> time, which means something is going wrong in the hash comparison >>> (since the definition is not really changing). >>> >>> The log that stands out to me is: >>> >>> trace May 18

Re: [ClusterLabs] How to correctly stop cluster with active stonith watchdog?

2019-04-30 Thread Andrei Borzenkov
about dynamic cluster expansion; the question is about normal static cluster with fixed number of nodes that needs to be shut down. >> 30 апр. 2019 г., в 7:07, Andrei Borzenkov написал(а): >> >> As soon as majority of nodes are stopped, the remaining nodes are out of >> q

Re: [ClusterLabs] How to correctly stop cluster with active stonith watchdog?

2019-04-30 Thread Andrei Borzenkov
30.04.2019 19:34, Олег Самойлов пишет: > >> No. I simply want reliable way to shutdown the whole cluster (for >> maintenance). > > Official way is `pcs cluster stop --all`. pcs is just one of multiple high level tools. I am interested in plumbing, not porcelain. > But it’s not always worked as

Re: [ClusterLabs] How to correctly stop cluster with active stonith watchdog?

2019-04-30 Thread Andrei Borzenkov
30.04.2019 9:53, Digimer пишет: > On 2019-04-30 12:07 a.m., Andrei Borzenkov wrote: >> As soon as majority of nodes are stopped, the remaining nodes are out of >> quorum and watchdog reboot kicks in. >> >> What is the correct procedure to ensure nodes are sto

Re: [ClusterLabs] Timeout stopping corosync-qdevice service

2019-04-30 Thread Andrei Borzenkov
30.04.2019 9:51, Jan Friesse пишет: > >> Now, corosync-qdevice gets SIGTERM as "signal to terminate", but it >> installs SIGTERM handler that does not exit and only closes some socket. >> May be this should trigger termination of main loop, but somehow it does >> not. > > Yep, this is exactly

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-27 Thread Andrei Borzenkov
27.04.2019 1:04, Danka Ivanović пишет: > Hi, here is a complete cluster configuration: > > node 1: master > node 2: secondary > primitive AWSVIP awsvip \ > params secondary_private_ip=10.x.x.x api_delay=5 > primitive PGSQL pgsqlms \ > params pgdata="/var/lib/postgresql/9.5/main" >

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-29 Thread Andrei Borzenkov
29.04.2019 18:05, Ken Gaillot пишет: >> >>> Why does not it check OCF_RESKEY_CRM_meta_notify? >> >> I was just not aware of this env variable. Sadly, it is not >> documented >> anywhere :( > > It's not a Pacemaker-created value like the other notify variables -- > all user-specified

[ClusterLabs] How to correctly stop cluster with active stonith watchdog?

2019-04-29 Thread Andrei Borzenkov
As soon as majority of nodes are stopped, the remaining nodes are out of quorum and watchdog reboot kicks in. What is the correct procedure to ensure nodes are stopped in clean way? Short of disabling stonith-watchdog-timeout before stopping cluster ...

Re: [ClusterLabs] Timeout stopping corosync-qdevice service

2019-04-29 Thread Andrei Borzenkov
29.04.2019 14:32, Jan Friesse пишет: > Andrei, > >> I setup qdevice in openSUSE Tumbleweed and while it works as expected I > > Is it corosync-qdevice or corosync-qnetd daemon? > corosync-qdevice >> cannot stop it - it always results in timeout and service finally gets >> killed by systemd.

Re: [ClusterLabs] monitor timed out with unknown error

2019-05-06 Thread Andrei Borzenkov
On Mon, May 6, 2019 at 8:30 AM Arkadiy Kulev wrote: > > Andrei, > > I just went through the docs > (https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html) > and it says that the option "failure-timeout" is responsible for retrying a > failed

Re: [ClusterLabs] crm_mon output to html-file - is there a way to manipulate the html-file ?

2019-05-03 Thread Andrei Borzenkov
03.05.2019 20:18, Lentes, Bernd пишет: > Hi, > > on my cluster nodes i established a systemd service which starts crm_mon > which writes cluster information into a html-file so i can see the state > of my cluster in a webbrowser. > crm_mon is started that way: > /usr/sbin/crm_mon -d -i 10 -h

[ClusterLabs] corosync-qdevice[3772]: Heuristics worker waitpid failed (10): No child processes

2019-05-04 Thread Andrei Borzenkov
While testing corosync-qdevice I repeatedly got the above message. The reason seems to be startup sequence in corosync-qdevice. Consider: ● corosync-qdevice.service - Corosync Qdevice daemon Loaded: loaded (/etc/systemd/system/corosync-qdevice.service; disabled; vendor preset: disabled)

Re: [ClusterLabs] How to correctly stop cluster with active stonith watchdog?

2019-05-04 Thread Andrei Borzenkov
30.04.2019 19:47, Олег Самойлов пишет: > > >> 30 апр. 2019 г., в 19:38, Andrei Borzenkov >> написал(а): >> >> 30.04.2019 19:34, Олег Самойлов пишет: >>> >>>> No. I simply want reliable way to shutdown the whole cluster >>>> (for

Re: [ClusterLabs] monitor timed out with unknown error

2019-05-05 Thread Andrei Borzenkov
05.05.2019 16:14, Arkadiy Kulev пишет: > Hello! > > I run pacemaker on 2 active/active hosts which balance the load of 2 public > IP addresses. > A few days ago we ran a very CPU/network intensive process on one of the 2 > hosts and Pacemaker failed. > > I've attached a screenshot of the

Re: [ClusterLabs] monitor timed out with unknown error

2019-05-05 Thread Andrei Borzenkov
> On Sun, May 5, 2019 at 11:05 PM Andrei Borzenkov > wrote: > >> 05.05.2019 18:43, Arkadiy Kulev пишет: >>> Dear Andrei, >>> >>> I'm sorry for the screenshot, this is the only thing that I have left >> after >>> the crash. >>> >&

Re: [ClusterLabs] monitor timed out with unknown error

2019-05-05 Thread Andrei Borzenkov
rerequisite was successful stop of resource. > Sincerely, > Ark. > > e...@ethaniel.com > > > On Sun, May 5, 2019 at 9:46 PM Andrei Borzenkov wrote: > >> 05.05.2019 16:14, Arkadiy Kulev пишет: >>> Hello! >>> >>> I run pacemaker on 2 active/activ

Re: [ClusterLabs] Problems with master/slave failovers

2019-07-03 Thread Andrei Borzenkov
On Wed, Jul 3, 2019 at 12:59 AM Ken Gaillot wrote: > > On Mon, 2019-07-01 at 23:30 +, Harvey Shepherd wrote: > > > The "transition summary" is just a resource-by-resource list, not > > > the > > > order things will be done. The "executing cluster transition" > > > section > > > is the order

Re: [ClusterLabs] Problems with master/slave failovers

2019-06-28 Thread Andrei Borzenkov
On Fri, Jun 28, 2019 at 7:24 AM Harvey Shepherd wrote: > > Hi All, > > > I'm running Pacemaker 2.0.2 on a two node cluster. It runs one master/slave > resource (I'll refer to it as the king resource) and about 20 other resources > which are a mixture of: > > > - resources that only run on the

Re: [ClusterLabs] [EXTERNAL] Re: "node is unclean" leads to gratuitous reboot

2019-07-11 Thread Andrei Borzenkov
On Thu, Jul 11, 2019 at 12:58 PM Lars Ellenberg wrote: > > On Wed, Jul 10, 2019 at 06:15:56PM +, Michael Powell wrote: > > Thanks to you and Andrei for your responses. In our particular > > situation, we want to be able to operate with either node in > > stand-alone mode, or with both nodes

Re: [ClusterLabs] "node is unclean" leads to gratuitous reboot

2019-07-09 Thread Andrei Borzenkov
On Tue, Jul 9, 2019 at 3:54 PM Michael Powell < michael.pow...@harmonicinc.com> wrote: > I have a two-node cluster with a problem. If I start Corosync/Pacemaker > on one node, and then delay startup on the 2nd node (which is otherwise > up and running), the 2nd node will be rebooted very soon

Re: [ClusterLabs] Problems with master/slave failovers

2019-06-29 Thread Andrei Borzenkov
28.06.2019 9:45, Andrei Borzenkov пишет: > On Fri, Jun 28, 2019 at 7:24 AM Harvey Shepherd > wrote: >> >> Hi All, >> >> >> I'm running Pacemaker 2.0.2 on a two node cluster. It runs one master/slave >> resource (I'll refer to it as the king resour

Re: [ClusterLabs] shutdown of 2-Node cluster when power outage

2019-04-20 Thread Andrei Borzenkov
20.04.2019 22:29, Lentes, Bernd пишет: > > > - Am 18. Apr 2019 um 16:21 schrieb kgaillot kgail...@redhat.com: > >> >> Simply stopping pacemaker and corosync by whatever mechanism your >> distribution uses (e.g. systemctl) should be sufficient. > > That works. But strangely is that after a

Re: [ClusterLabs] Antw: Interacting with Pacemaker from my code

2019-07-16 Thread Andrei Borzenkov
On Tue, Jul 16, 2019 at 11:01 AM Nishant Nakate wrote: > >> > >> > I will give you a quick overview of the system. There would be 3 nodes >> > configured in a cluster. One would act as a leader and others as >> > followers. Our system would be actively running on all the three nodes and >> >

Re: [ClusterLabs] Problems with master/slave failovers

2019-07-01 Thread Andrei Borzenkov
02.07.2019 2:30, Harvey Shepherd пишет: >> The "transition summary" is just a resource-by-resource list, not the >> order things will be done. The "executing cluster transition" section >> is the order things are being done. > > Thanks Ken. I think that's where the problem is originating. If you

Re: [ClusterLabs] Problems with master/slave failovers

2019-06-29 Thread Andrei Borzenkov
29.06.2019 8:05, Harvey Shepherd пишет: > There is an ordering constraint - everything must be started after the king > resource. But even if this constraint didn't exist I don't see that it should > logically make any difference due to all the non-clone resources being > colocated with the

[ClusterLabs] How to clean up failed fencing action?

2019-08-03 Thread Andrei Borzenkov
I'm using sbd watchdog and stonith-watchdog-timeout without explicit stonith agents (shared nothing cluster). How can I clean up failed fencing action? Current DC: ha1 (version 2.0.1+20190408.1b68da8e8-1.3-2.0.1+20190408.1b68da8e8) - partition with quorum Last updated: Sat Aug 3 19:10:12 2019

Re: [ClusterLabs] Reusing resource set in multiple constraints

2019-08-03 Thread Andrei Borzenkov
29.07.2019 22:07, Ken Gaillot пишет: > On Sat, 2019-07-27 at 11:04 +0300, Andrei Borzenkov wrote: >> Is it possible to have single definition of resource set that is >> later >> references in order and location constraints? All syntax in >> documentation or crmsh pre

Re: [ClusterLabs] Antw: Re: Gracefully stop nodes one by one with disk-less sbd

2019-08-12 Thread Andrei Borzenkov
Отправлено с iPhone 12 авг. 2019 г., в 9:48, Ulrich Windl написал(а): >>>> Andrei Borzenkov schrieb am 09.08.2019 um 18:40 in > Nachricht <217d10d8-022c-eaf6-28ae-a4f58b2f9...@gmail.com>: >> 09.08.2019 16:34, Yan Gao пишет: >>> Hi, >>> >&

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-12 Thread Andrei Borzenkov
Отправлено с iPhone > 12 авг. 2019 г., в 8:46, Jan Friesse написал(а): > > Олег Самойлов napsal(a): >>> 9 авг. 2019 г., в 9:25, Jan Friesse написал(а): >>> Please do not set dpd_interval that high. dpd_interval on qnetd side is not >>> about how often is the ping is sent. Could you please

Re: [ClusterLabs] [EXTERNAL] Users Digest, Vol 55, Issue 19

2019-08-12 Thread Andrei Borzenkov
terlabs.org > > When replying, please edit your Subject line so it is more specific than "Re: > Contents of Users digest..." > > > Today's Topics: > >1. why is node fenced ? (Lentes, Bernd) >2. Postgres HA - pacemaker RA do not support aut

Re: [ClusterLabs] Master/slave failover does not work as expected

2019-08-12 Thread Andrei Borzenkov
On Mon, Aug 12, 2019 at 4:12 PM Michael Powell < michael.pow...@harmonicinc.com> wrote: > At 07:44:49, the ss agent discovers that the master instance has failed on > node *mgraid…-0* as a result of a failed *ssadm* request in response to > an *ss_monitor()* operation. It issues a *crm_master -Q

Re: [ClusterLabs] Pacemaker - mounting md devices and run quotaon command

2019-08-20 Thread Andrei Borzenkov
On Tue, Aug 20, 2019 at 1:03 AM Del Monaco, Andrea wrote: > > Hi Users, > > > > As per title – do you know if there is some resource in pacemaker that allows > a filesystem (md array) to be mounted and then run the quotaon command on it Is not quota information persistent so it is enough to run

Re: [ClusterLabs] Thoughts on crm shell

2019-08-22 Thread Andrei Borzenkov
22.08.2019 12:49, Ulrich Windl пишет: > Hi! > > It's been a while since I used crm shell, and now after having moved from > SLES11 to SLES12 (jhaving to use it again), I realized a few things: > > 1) As the ptest command is crm_simulate now, shouldn't crm shell's ptest (in > configure) be

Re: [ClusterLabs] node name issues (Could not obtain a node name for corosync nodeid 739512332)

2019-08-22 Thread Andrei Borzenkov
22.08.2019 10:07, Ulrich Windl пишет: > Hi! > > When starting pacemaker (1.1.19+20181105.ccd6b5b10-3.10.1) on a node that had > been down for a while, I noticed some unexpected messages about the node name: > > pacemakerd: notice: get_node_name: Could not obtain a node name for > corosync

Re: [ClusterLabs] Command to show location constraints?

2019-08-27 Thread Andrei Borzenkov
27.08.2019 18:24, Casey & Gina пишет: > Hi, I'm looking for a way to show just location constraints, if they exist, > for a cluster. I'm looking for the same data shown in the output of `pcs > config` under the "Location Constraints:" header, but without all the rest, > so that I can write a

Re: [ClusterLabs] New status reporting for starting/stopping resources in 1.1.19-8.el7

2019-08-30 Thread Andrei Borzenkov
31.08.2019 6:39, Chris Walker пишет: > Hello, > The 1.1.19-8 EL7 version of Pacemaker contains a commit ‘Feature: crmd: > default record-pending to TRUE’ that is not in the ClusterLabs Github repo. commit b48ceeb041cee65a9b93b9b76235e475fa1a128f Author: Ken Gaillot Date: Mon Oct 16 09:45:18

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-03 Thread Andrei Borzenkov
03.09.2019 11:09, Marco Marino пишет: > Hi, I have a problem with fencing on a two node cluster. It seems that > randomly the cluster cannot complete monitor operation for fence devices. > In log I see: > crmd[8206]: error: Result of monitor operation for fence-node2 on > ld2.mydomain.it: Timed

Re: [ClusterLabs] IPAddr2 RA and CLUSTERIP local_node

2019-09-03 Thread Andrei Borzenkov
04.09.2019 2:03, Tomer Azran пишет: > Hello, > > When using IPaddr2 RA in order to set a cloned IP address resource: > > pcs resource create vip1 ocf:heartbeat:IPaddr2 ip=10.0.0.100 iflabel=vip1 > cidr_netmask=24 flush_routes=true op monitor interval=30s > pcs resource clone vip1 clone-max=2

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-03 Thread Andrei Borzenkov
04.09.2019 0:27, wf...@niif.hu пишет: > Jeevan Patnaik writes: > >> [16187] node1 corosyncwarning [MAIN ] Corosync main process was not >> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token >> timeout increase. >> [...] >> 2. How to fix this? We have not much load on the

Re: [ClusterLabs] IPaddr2 RA and multicast mac

2019-09-03 Thread Andrei Borzenkov
04.09.2019 1:27, Tomer Azran пишет: > Hello, > > When using IPaddr2 RA in order to set a cloned IP address resource: > > pcs resource create vip1 ocf:heartbeat:IPaddr2 ip=10.0.0.100 iflabel=vip1 > cidr_netmask=24 flush_routes=true op monitor interval=30s > pcs resource clone vip1 clone-max=2

Re: [ClusterLabs] Antw: Re: node name issues (Could not obtain a node name for corosync nodeid 739512332)

2019-08-26 Thread Andrei Borzenkov
On Mon, Aug 26, 2019 at 9:59 AM Ulrich Windl wrote: > Also see my earlier message. If adding the node name to corosync conf is > highly recommended, I wonder why SUSE's SLES procedure does not set it... > If you mean ha-cluster-init/ha-cluster-join, it just invokes "crm cluster", so you may

Re: [ClusterLabs] Antw: Re: Antw: Re: pacemaker resources under systemd

2019-09-12 Thread Andrei Borzenkov
On Thu, Sep 12, 2019 at 3:45 PM Ulrich Windl wrote: > > >>> Andrei Borzenkov schrieb am 12.09.2019 um 14:21 in > Nachricht > : > > On Thu, Sep 12, 2019 at 12:40 PM Ulrich Windl > > wrote: > >> > >> Hi! > >> > >> I just d

Re: [ClusterLabs] Antw: Re: pacemaker resources under systemd

2019-09-12 Thread Andrei Borzenkov
On Thu, Sep 12, 2019 at 12:40 PM Ulrich Windl wrote: > > Hi! > > I just discovered an unpleasant side-effect of this: > SLES has "zypper ps" to show processes that use obsoleted binaries. Now if any > resource binary was replaced, zypper suggests to restart pacemaker (which is > nonsense, of

Re: [ClusterLabs] Reusing resource set in multiple constraints

2019-07-28 Thread Andrei Borzenkov
27.07.2019 11:04, Andrei Borzenkov пишет: > Is it possible to have single definition of resource set that is later > references in order and location constraints? All syntax in > documentation or crmsh presumes inline set definition in location or > order statement. > > In th

Re: [ClusterLabs] Query on HA

2019-08-05 Thread Andrei Borzenkov
There is no one-size-fits-all answer. You should enable and configure stonith in pacemaker (which is disabled, otherwise described situation would not happen). You may consider wait_for_all (or better two_node) options in corosync that would prevent pacemaker to start unless both nodes are up. On

Re: [ClusterLabs] Compile fence agent on Ubuntu failing

2019-08-07 Thread Andrei Borzenkov
07.08.2019 12:21, Oleg Ulyanov пишет: > Hi all, > I’m facing a problem with fence_vmware_soap on Ubuntu 16.04. Being able to > resolve dependency missing by manually installing python packages, I still > not able to connect to my vcenter. Apparently it’s a problem with 4.0.22 > version and

Re: [ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-07-29 Thread Andrei Borzenkov
On Mon, Jul 29, 2019 at 9:52 AM Jan Friesse wrote: > > Andrei > > Andrei Borzenkov napsal(a): > > corosync.service sets StopWhenUnneded=yes which normally stops it when > > This was the case only for very limited time (v 3.0.1) and it's removed > now (v 3.0.2) because i

[ClusterLabs] Node reset on shutdown by SBD watchdog with corosync-qdevice

2019-07-28 Thread Andrei Borzenkov
In two node cluster + qnetd I consistently see the node that is being shut down last being reset during shutdown. I.e. - shutdown the first node - OK - shutdown the second node - reset As far as I understand what happens is - during shutdown pacemaker.service is stopped first. In above

[ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-07-28 Thread Andrei Borzenkov
corosync.service sets StopWhenUnneded=yes which normally stops it when pacemaker is shut down. Unfortunately, corosync-qdevice.service declares Requires=corosync.service and corosync-qdevice.service itself is *not* stopped when pacemaker.service is stopped. Which means corosync.service remains

[ClusterLabs] Reusing resource set in multiple constraints

2019-07-27 Thread Andrei Borzenkov
Is it possible to have single definition of resource set that is later references in order and location constraints? All syntax in documentation or crmsh presumes inline set definition in location or order statement. In this particular case there will be set of filesystems that need to be

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-09 Thread Andrei Borzenkov
On Fri, Aug 9, 2019 at 9:25 AM Jan Friesse wrote: > > Олег Самойлов napsal(a): > > Hello all. > > > > I have a test bed with several virtual machines to test pacemaker. I > > simulate random failure on one of the node. The cluster will be on several > > data centres, so there is not stonith

Re: [ClusterLabs] Gracefully stop nodes one by one with disk-less sbd

2019-08-09 Thread Andrei Borzenkov
09.08.2019 16:34, Yan Gao пишет: > Hi, > > With disk-less sbd, it's fine to stop cluster service from the cluster > nodes all at the same time. > > But if to stop the nodes one by one, for example with a 3-node cluster, > after stopping the 2nd node, the only remaining node resets itself

Re: [ClusterLabs] Antw: Interacting with Pacemaker from my code

2019-07-16 Thread Andrei Borzenkov
On Tue, Jul 16, 2019 at 9:48 AM Nishant Nakate wrote: > > > On Tue, Jul 16, 2019 at 11:33 AM Ulrich Windl > wrote: >> >> >>> Nishant Nakate schrieb am 16.07.2019 um 05:37 >> >>> in >> Nachricht >> : >> > Hi All, >> > >> > I am new to this community and HA tools. Need some guidance on my

Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-25 Thread Andrei Borzenkov
On Thu, Jul 25, 2019 at 3:20 AM Ondrej wrote: > > Is there any plan on getting this also into 1.1 branch? > If yes, then I would be for just introducing the configuration option in > 1.1.x with default to 'stop'. > +1 for back porting it from someone who just recently hit this (puzzling)

Re: [ClusterLabs] Fence_sbd script in Fedora30?

2019-09-23 Thread Andrei Borzenkov
23.09.2019 23:23, Vitaly Zolotusky пишет: > Hello, > I am trying to upgrade to Fedora 30. The platform is two node cluster with > pacemaker. > It Fedora 28 we were using old fence_sbd script from 2013: > > # This STONITH script drives the shared-storage stonith plugin. > # Copyright (C) 2013

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-09 Thread Andrei Borzenkov
09.07.2019 13:08, Danka Ivanović пишет: > Hi I didn't manage to start master with postgres, even if I increased start > timeout. I checked executable paths and start options. > When cluster is running with manually started master and slave started over > pacemaker, everything works ok. Today we

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Andrei Borzenkov
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais wrote: > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > postgres1 lrmd: warning: child_timeout_callback: > > > PGSQL_monitor_15000

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Andrei Borzenkov
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais wrote: > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows > > result of original resource probing which makes it confusing. At least > > it explains where these logs entries come from. > > Not sure tu understand

Re: [ClusterLabs] fencing on iscsi device not working

2019-10-30 Thread Andrei Borzenkov
30.10.2019 15:46, RAM PRASAD TWISTED ILLUSIONS пишет: > Hi everyone, > > I am trying to set up a storage cluster with two nodes, both running debian > buster. The two nodes called, duke and miles, have a LUN residing on a SAN > box as their shared storage device between them. As you can see in

Re: [ClusterLabs] Antw: Re: fencing on iscsi device not working

2019-11-06 Thread Andrei Borzenkov
06.11.2019 18:55, Ken Gaillot пишет: > On Wed, 2019-11-06 at 08:04 +0100, Ulrich Windl wrote: > Ken Gaillot schrieb am 05.11.2019 um > 16:05 in >> >> Nachricht >> : >>> Coincidentally, the documentation for the pcmk_host_check default >>> was >>> recently updated for the upcoming 2.0.3

Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource

2019-12-04 Thread Andrei Borzenkov
On Thu, Dec 5, 2019 at 1:04 AM Jan Pokorný wrote: > > On 04/12/19 21:19 +0100, Jan Pokorný wrote: > > OTOH, this enforced split of state transitions is perhaps what makes > > the transaction (comprising perhaps countless other interdependent > > resources) serializable and thus feasible at all

Re: [ClusterLabs] serious problem with iSCSILogicalUnit

2019-12-16 Thread Andrei Borzenkov
16.12.2019 18:26, Stefan K пишет: > I thnik I got it.. > > It looks like that (A) > order pcs_rsc_order_set_iscsi-server_haip iscsi-server:start > iscsi-lun00:start iscsi-lun01:start iscsi-lun02:start ha-ip:start > symmetrical=false It is different from configuration you show originally. >

Re: [ClusterLabs] Support for 'score' in rsc_order is deprecated...use 'kind' instead...

2019-10-28 Thread Andrei Borzenkov
28.10.2019 20:00, Jean-Francois Malouin пишет: > Hi, > > Building a new pacemaker cluster using corosync 3.0 and pacemaker 2.0.1 on > Debian/Buster 10 > I get this error when trying to insert a order constraint in the CIB to first > promote drbd to primary > then start/scan LVM. It used to work

Re: [ClusterLabs] -INFINITY location constraint not honored?

2019-10-18 Thread Andrei Borzenkov
According to it, you have symmetric cluster (and apparently made typo trying to change it) On Fri, Oct 18, 2019 at 10:29 AM Raffaele Pantaleoni wrote: > > Il 17/10/2019 18:08, Ken Gaillot ha scritto: > > This does sound odd, possibly a bug. Can you provide the output of "pcs >

Re: [ClusterLabs] What happened to "crm resource migrate"?

2019-10-15 Thread Andrei Borzenkov
On Tue, Oct 15, 2019 at 11:58 AM Yan Gao wrote: > > > > Help for "move" still says: > > resource# help move > > Move a resource to another node > > > > Move a resource away from its current location. > Looks like an issue in the version of crmsh. > > Xin, could you please take a look? > >

Re: [ClusterLabs] reducing corosync-qnetd "response time"

2019-10-24 Thread Andrei Borzenkov
24.10.2019 16:54, Sherrard Burton пишет: > background: > we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS, > with no stonith, to a much more modern implementation. for the existing > cluster, as well as the new one, the disk space requirements make > running a full

Re: [ClusterLabs] active/passive resource config

2019-10-25 Thread Andrei Borzenkov
On Fri, Oct 25, 2019 at 9:03 AM jyd <471204...@qq.com> wrote: > > Hi: > I want to user pacemaker to mange a resource named A,i want A only > started on one node, > only when the node is down or A can not started in this node,the A resource > will started on other nodes. > And config a

Re: [ClusterLabs] volume group won't start in a nested DRBD setup

2019-10-28 Thread Andrei Borzenkov
28.10.2019 22:44, Jean-Francois Malouin пишет: > Hi, > > Is there any new magic that I'm unaware of that needs to be added to a > pacemaker cluster using a DRBD nested setup? pacemaker 2.0.x and DRBD 8.4.10 > on > Debian/Buster on a 2-node cluster with stonith. > Eventually this will host a

Re: [ClusterLabs] Antw: Safe way to stop pacemaker on both nodes of a two node cluster

2019-10-23 Thread Andrei Borzenkov
21.10.2019 9:39, Ulrich Windl пишет: "Dileep V Nair" schrieb am 20.10.2019 um 17:54 in > Nachricht > > m>: > >> Hi, >> >> I am confused about the best way to stop pacemaker on both nodes of a >> two node cluster. The options I know of are >> 1. Put the cluster in Maintenance Mode,

Re: [ClusterLabs] SLES12 SP4: update_cib_stonith_devices_v2 nonsense "Watchdog will be used via SBD if fencing is required"

2019-10-23 Thread Andrei Borzenkov
23.10.2019 13:35, Ulrich Windl пишет: > Hi! > > In SLES12 SP4 I'm kind of annoyed due to repeating messages "unpack_config: > Watchdog will be used via SBD if fencing is required". > > While examining another problem, I found this sequence: > * Some unrelated resource was moved (migrated) >

Re: [ClusterLabs] -INFINITY location constraint not honored?

2019-10-18 Thread Andrei Borzenkov
18.10.2019 12:43, Raffaele Pantaleoni пишет: > > Il 18/10/2019 10:21, Andrei Borzenkov ha scritto: >> According to it, you have symmetric cluster (and apparently made typo >> trying to change it) >> >> > name="symmetric-cluster" value=&quo

Re: [ClusterLabs] node avoidance still leads to "status=Not installed" error for monitor op

2019-11-30 Thread Andrei Borzenkov
29.11.2019 16:37, Dennis Jacobfeuerborn пишет: Hi, I'm currently trying to set up a drbd 8.4 resource in a 3-node pacemaker cluster. The idea is to have nodes storage1 and storage2 running with the drbd clones and only use the third node storage3 for quorum. The way I'm trying to do this: pcs

Re: [ClusterLabs] Concept of a Shared ipaddress/resource for generic applicatons

2019-11-30 Thread Andrei Borzenkov
29.11.2019 17:46, Jan Pokorný пишет: On 27/11/19 20:13 +, matt_murd...@amat.com wrote: I finally understand that there is a Apache Resource for Pacemaker that assigns a single virtual ipaddress that "floats" between two nodes as in webservers.

[ClusterLabs] SBD with shared device - loss of both interconnect and shared device?

2019-10-09 Thread Andrei Borzenkov
What happens if both interconnect and shared device is lost by node? I assume node will reboot, correct? Now assuming (two node cluster) second node still can access shared device it will fence (via SBD) and continue takeover, right? If both nodes lost shared device, both nodes will reboot and

<    1   2   3   4   5   6   7   >