Re: [ClusterLabs] Master/slave failover does not work as expected

2019-08-13 Thread Jan Pokorný
On 13/08/19 09:44 +0200, Ulrich Windl wrote: Harvey Shepherd schrieb am 12.08.2019 um 23:38 > in Nachricht : >> I've been experiencing exactly the same issue. Pacemaker prioritises >> restarting the failed resource over maintaining a master instance. In my >> case >> I used

Re: [ClusterLabs] Increasing fence timeout

2019-08-13 Thread Casey & Gina
Thank you, I reached the same conclusion after reaching through the script. Another question - I am no longer seeing the error quoted below as I've increased shell_timeout to 30 seconds, but failovers are still happening. From the logs, it appears that the cluster simply loses communication

Re: [ClusterLabs] Antw: Antw: Re: Q: "crmd[7281]: warning: new_event_notification (7281-97955-15): Broken pipe (32)" as response to resource cleanup

2019-08-13 Thread Ken Gaillot
On Tue, 2019-08-13 at 11:06 +0200, Ulrich Windl wrote: > Hi, > > an update: > After setting a failure-timeout for the resource that stale monitor > failure > was removed automatically at next cluster recheck (it seems). > Still I wonder why a resource cleanup didn't do that (bug?). Possibly ...

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-13 Thread Jan Friesse
Олег Самойлов napsal(a): 13 авг. 2019 г., в 15:55, Jan Friesse написал(а): There is going to be slightly different solution (set this timeouts based on corosync token timeout) which I'm working on, but it's kind of huge amount of work and not super high prio (workaround exists), so no ETA

Re: [ClusterLabs] Antw: Re: why is node fenced ?

2019-08-13 Thread Lentes, Bernd
- On Aug 13, 2019, at 3:14 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > You said you booted the hosts sequentially. From the logs they were starting > in > parallel. > No. last says: ha-idg-1: reboot system boot 4.12.14-95.29-de Fri Aug 9 17:42 - 15:56 (3+22:14)

Re: [ClusterLabs] why is node fenced ?

2019-08-13 Thread Lentes, Bernd
- On Aug 13, 2019, at 3:34 PM, Matthias Ferdinand m...@14v.de wrote: >> 17:26:35 crm node standby ha-idg1- > > if that is not a copy error (ha-idg1- vs. ha-idg-1), then ha-idg-1 > was not set to standby, and installing updates may have done some > meddling with corosync/pacemaker (like

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-13 Thread Олег Самойлов
> 13 авг. 2019 г., в 15:55, Jan Friesse написал(а): > > There is going to be slightly different solution (set this timeouts based on > corosync token timeout) which I'm working on, but it's kind of huge amount of > work and not super high prio (workaround exists), so no ETA yet. Is it will

Re: [ClusterLabs] Master/slave failover does not work as expected

2019-08-13 Thread Michael Powell
ct, this was sufficient to > > promote the other instance to master, but in the current product, > > that does not happen. Currently, the failed application is > > restarted, as expected, and is promoted to master, but this takes 10?s of > > seconds. > > > > > > > > Did you try to disable resource stickiness for this ms? &g

Re: [ClusterLabs] why is node fenced ?

2019-08-13 Thread Matthias Ferdinand
On Mon, Aug 12, 2019 at 04:09:48PM -0400, users-requ...@clusterlabs.org wrote: > Date: Mon, 12 Aug 2019 18:09:24 +0200 (CEST) > From: "Lentes, Bernd" > To: Pacemaker ML > Subject: [ClusterLabs] why is node fenced ? > Message-ID: >

[ClusterLabs] Antw: Re: why is node fenced ?

2019-08-13 Thread Ulrich Windl
You said you booted the hosts sequentially. From the logs they were starting in parallel. >>> "Lentes, Bernd" schrieb am 13.08.2019 um 13:53 in Nachricht <767205671.1953556.1565697218136.javamail.zim...@helmholtz-muenchen.de>: > ‑ On Aug 12, 2019, at 7:47 PM, Chris Walker cwal...@cray.com

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-13 Thread Jan Friesse
Олег Самойлов napsal(a): 12 авг. 2019 г., в 8:46, Jan Friesse написал(а): Let me try to bring some light in there: - dpd_interval is qnetd variable how often qnetd walks thru the list of all clients (qdevices) and checks timestamp of last sent message. If diff between current timestamp

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-13 Thread Олег Самойлов
> 12 авг. 2019 г., в 8:46, Jan Friesse написал(а): > > Let me try to bring some light in there: > > - dpd_interval is qnetd variable how often qnetd walks thru the list of all > clients (qdevices) and checks timestamp of last sent message. If diff between > current timestamp and last sent

Re: [ClusterLabs] why is node fenced ?

2019-08-13 Thread Lentes, Bernd
- On Aug 12, 2019, at 7:47 PM, Chris Walker cwal...@cray.com wrote: > When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, for > example, > > Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd: info: > pcmk_quorum_notification: > Quorum retained | membership=1320 members=1 >

[ClusterLabs] pcs 0.9.168 released

2019-08-13 Thread Tomas Jelinek
I am happy to announce the latest release of pcs, version 0.9.168. Source code is available at: https://github.com/ClusterLabs/pcs/archive/0.9.168.tar.gz or https://github.com/ClusterLabs/pcs/archive/0.9.168.zip Complete change log for this release: ## [0.9.168] - 2019-08-02 ### Added - It is

[ClusterLabs] Antw: Re: Antw: why is node fenced ?

2019-08-13 Thread Ulrich Windl
>>> "Lentes, Bernd" schrieb am 13.08.2019 um 10:54 in Nachricht <848962511.1856599.1565686469666.javamail.zim...@helmholtz-muenchen.de>: > > ‑ On Aug 13, 2019, at 9:00 AM, Ulrich Windl ulrich.wi...@rz.uni‑regensburg.de > wrote: > >> Personally I feel more save with updates when the whole

[ClusterLabs] Antw: Antw: Re: Q: "crmd[7281]: warning: new_event_notification (7281-97955-15): Broken pipe (32)" as response to resource cleanup

2019-08-13 Thread Ulrich Windl
Hi, an update: After setting a failure-timeout for the resource that stale monitor failure was removed automatically at next cluster recheck (it seems). Still I wonder why a resource cleanup didn't do that (bug?). Regards, Ulrich >>> "Ulrich Windl" schrieb am 13.08.2019 um 10:07 in Nachricht

[ClusterLabs] Antw: Antw: why is node fenced ?

2019-08-13 Thread Ulrich Windl
>>> "Ulrich Windl" schrieb am 13.08.2019 um 09:00 in Nachricht <5d52600102a100032...@gwsmtp.uni-regensburg.de>: ... > Personally I feel more save with updates when the whole cluster node is Of course I meant "more safe"... Time for coffein it seems ;-) ...

Re: [ClusterLabs] Ganesha, after a system reboot, showmounts nothing - why?

2019-08-13 Thread lejeczek
On 07/08/2019 16:48, lejeczek wrote: > hi guys, > > after a reboot Ganesha export are not there. Suffices to do: $ systemctl > restart nfs-ganesha - and all is good again. > > Would you have any ideas why? > > I'm on Centos 7.6 with nfs-ganesha-gluster-2.7.6-1.el7.x86_64; >

Re: [ClusterLabs] Antw: why is node fenced ?

2019-08-13 Thread Lentes, Bernd
- On Aug 13, 2019, at 9:00 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: > Personally I feel more save with updates when the whole cluster node is > offline, not standby. When you are going to boot anyway, it won't make much of > a difference. Also you don't have to remember

[ClusterLabs] Antw: Re: Q: "crmd[7281]: warning: new_event_notification (7281-97955-15): Broken pipe (32)" as response to resource cleanup

2019-08-13 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 13.08.2019 um 01:03 in Nachricht : > On Mon, 2019‑08‑12 at 17:46 +0200, Ulrich Windl wrote: >> Hi! >> >> I just noticed that a "crm resource cleanup " caused some >> unexpected behavior and the syslog message: >> crmd[7281]: warning: new_event_notification

[ClusterLabs] Antw: Re: Master/slave failover does not work as expected

2019-08-13 Thread Ulrich Windl
>>> Harvey Shepherd schrieb am 12.08.2019 um >>> 23:38 in Nachricht : > I've been experiencing exactly the same issue. Pacemaker prioritises > restarting the failed resource over maintaining a master instance. In my case > I used crm_simulate to analyse the actions planned and taken by

[ClusterLabs] Antw: Re: Restoring network connection breaks cluster services

2019-08-13 Thread Ulrich Windl
>>> Jan Pokorný schrieb am 12.08.2019 um 22:30 in Nachricht <20190812203037.gm25...@redhat.com>: [...] > Is it OK for lower level components to do autonomous decisions > without at least informing the higher level wrt. what exactly is > going on, as we could observe here? [...] Excuse me for

[ClusterLabs] Antw: why is node fenced ?

2019-08-13 Thread Ulrich Windl
>>> "Lentes, Bernd" schrieb am 12.08.2019 um 18:09 in Nachricht <546330844.1686419.1565626164456.javamail.zim...@helmholtz-muenchen.de>: > Hi, > > last Friday (9th of August) i had to install patches on my two-node cluster. > I put one of the nodes (ha-idg-2) into standby (crm node standby

Re: [ClusterLabs] Gracefully stop nodes one by one with disk-less sbd

2019-08-13 Thread Roger Zhou
On 8/12/19 9:24 PM, Klaus Wenninger wrote: [...] > If you shutdown solely pacemaker one-by-one on all nodes > and these shutdowns are considered graceful then you are > not gonna experience any reboots (e.g. 3 node cluster). While revisit what you said, then run `systemctl stop pacemaker`

Re: [ClusterLabs] Restoring network connection breaks cluster services

2019-08-13 Thread Jan Friesse
Momcilo On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger wrote: On 8/7/19 12:26 PM, Momcilo Medic wrote: We have three node cluster that is setup to stop resources on lost quorum. Failure (network going down) handling is done properly, but recovery doesn't seem to work. What do you mean by