Re: [ClusterLabs] how to setup single node cluster

2021-04-07 Thread d tbsky
Reid Wahl 
> Disaster recovery is the main use case we had in mind. See the RHEL 8.2 
> release notes:
>   - 
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/8.2_release_notes/rhel-8-2-0-release#enhancement_high-availability-and-clusters
>
> I thought I also remembered some other use case involving MS SQL, but I can't 
> find anything about it so I might be remembering incorrectly.

thanks a lot for confirmation. according to the discussion above, I
think the setup procedure is similar for singe-node cluster. I should
still use corosync (although it seems sync nowhere).
I will try that when I have time.
thanks again for your kindly help!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] how to setup single node cluster

2021-04-07 Thread Reid Wahl
On Wed, Apr 7, 2021 at 11:27 PM d tbsky  wrote:

> Reid Wahl 
> > I don't think we do require fencing for single-node clusters. (Anyone at
> Red Hat, feel free to comment.) I vaguely recall an internal mailing list
> or IRC conversation where we discussed this months ago, but I can't find it
> now. I've also checked our support policies documentation, and it's not
> mentioned in the "cluster size" doc or the "fencing" doc.
>
>since the cluster is 100% alive or 100% dead with single node, I
> think fencing/quorum is not required. I am just curious what is the
> usage case. since RedHat supports it, it must be useful in real
> scenario.
>

Disaster recovery is the main use case we had in mind. See the RHEL 8.2
release notes:
  -
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/8.2_release_notes/rhel-8-2-0-release#enhancement_high-availability-and-clusters

I thought I also remembered some other use case involving MS SQL, but I
can't find anything about it so I might be remembering incorrectly.


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] how to setup single node cluster

2021-04-07 Thread d tbsky
Reid Wahl 
> I don't think we do require fencing for single-node clusters. (Anyone at Red 
> Hat, feel free to comment.) I vaguely recall an internal mailing list or IRC 
> conversation where we discussed this months ago, but I can't find it now. 
> I've also checked our support policies documentation, and it's not mentioned 
> in the "cluster size" doc or the "fencing" doc.

   since the cluster is 100% alive or 100% dead with single node, I
think fencing/quorum is not required. I am just curious what is the
usage case. since RedHat supports it, it must be useful in real
scenario.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] how to setup single node cluster

2021-04-07 Thread Klaus Wenninger

On 4/8/21 8:16 AM, Reid Wahl wrote:



On Wed, Apr 7, 2021 at 9:46 PM Strahil Nikolov > wrote:


I always though that the setup is the same, just the node count is
only one.

I guess you need pcs, corosync + pacemaker.
If RH is going to support it, they will require fencing. Most
probably sbd or ipmi are the best candidates.


I don't think we do require fencing for single-node clusters. (Anyone 
at Red Hat, feel free to comment.) I vaguely recall an internal 
mailing list or IRC conversation where we discussed this months ago, 
but I can't find it now. I've also checked our support policies 
documentation, and it's not mentioned in the "cluster size" doc or the 
"fencing" doc.


The closest thing I can find is the following, from the cluster size 
doc[1]:

~~~
RHEL 8.2 and later: Support for 1 or more nodes

  * Single node clusters do not support DLM and GFS2 filesystems (as
they require fencing).

~~~

To me that suggests that fencing isn't required in a single-node 
cluster. Maybe sbd could work (I haven't thought it through), but 
conventional power fencing (e.g., fence_ipmilan) wouldn't. That's 
because most conventional power fencing agents require sending a 
"power on" signal after the "power off" is complete.

And moreover you have to be alive enough to kick off
conventional power fencing to self-fence ;-)
With sbd the hardware-watchdog should kick in.

Klaus


[1] https://access.redhat.com/articles/3069031 




Best Regards,
Strahil Nikolov

On Thu, Apr 8, 2021 at 6:52, d tbsky
mailto:tbs...@gmail.com>> wrote:
Hi:
    I found RHEL 8.2 support single node cluster now.  but I
didn't
find further document to explain the concept. RHEL 8.2 also
support
"disaster recovery cluster". so I think maybe a single node
disaster
recovery cluster is not bad.

  I think corosync is still necessary under single node
cluster. or
is there other new style of configuration?

    thanks for help!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users


ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users


ClusterLabs home: https://www.clusterlabs.org/




--
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] how to setup single node cluster

2021-04-07 Thread d tbsky
Ulrich Windl 
>
> >>> d tbsky  schrieb am 08.04.2021 um 05:52 in Nachricht
> :
> > Hi:
> > I found RHEL 8.2 support single node cluster now.  but I didn't
> > find further document to explain the concept. RHEL 8.2 also support
> > "disaster recovery cluster". so I think maybe a single node disaster
> > recovery cluster is not bad.
> >
> >I think corosync is still necessary under single node cluster. or
> > is there other new style of configuration?
>
> IMHO if you want a single-.node cluster, and you are not planning to add more 
> nodes, you'll be better off using a utility like monit to manage your 
> processes...

sorry I didn't mention pacemaker in my previous post. I want a single
node pacemaker disaster recovery cluster, which can be managed by
normal pacemaker utilities like pcs.
maybe there is other case which single node pacemaker cluster is
useful, I just don't know now.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] how to setup single node cluster

2021-04-07 Thread Reid Wahl
On Wed, Apr 7, 2021 at 9:46 PM Strahil Nikolov 
wrote:

> I always though that the setup is the same, just the node count is only
> one.
>
> I guess you need pcs, corosync + pacemaker.
> If RH is going to support it, they will require fencing. Most probably sbd
> or ipmi are the best candidates.
>

I don't think we do require fencing for single-node clusters. (Anyone at
Red Hat, feel free to comment.) I vaguely recall an internal mailing list
or IRC conversation where we discussed this months ago, but I can't find it
now. I've also checked our support policies documentation, and it's not
mentioned in the "cluster size" doc or the "fencing" doc.

The closest thing I can find is the following, from the cluster size doc[1]:
~~~
RHEL 8.2 and later: Support for 1 or more nodes

   - Single node clusters do not support DLM and GFS2 filesystems (as they
   require fencing).

~~~

To me that suggests that fencing isn't required in a single-node cluster.
Maybe sbd could work (I haven't thought it through), but conventional power
fencing (e.g., fence_ipmilan) wouldn't. That's because most conventional
power fencing agents require sending a "power on" signal after the "power
off" is complete.

[1] https://access.redhat.com/articles/3069031


> Best Regards,
> Strahil Nikolov
>
> On Thu, Apr 8, 2021 at 6:52, d tbsky
>  wrote:
> Hi:
> I found RHEL 8.2 support single node cluster now.  but I didn't
> find further document to explain the concept. RHEL 8.2 also support
> "disaster recovery cluster". so I think maybe a single node disaster
> recovery cluster is not bad.
>
>   I think corosync is still necessary under single node cluster. or
> is there other new style of configuration?
>
> thanks for help!
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] how to setup single node cluster

2021-04-07 Thread Ulrich Windl
>>> d tbsky  schrieb am 08.04.2021 um 05:52 in Nachricht
:
> Hi:
> I found RHEL 8.2 support single node cluster now.  but I didn't
> find further document to explain the concept. RHEL 8.2 also support
> "disaster recovery cluster". so I think maybe a single node disaster
> recovery cluster is not bad.
> 
>I think corosync is still necessary under single node cluster. or
> is there other new style of configuration?

IMHO if you want a single-.node cluster, and you are not planning to add more 
nodes, you'll be better off using a utility like monit to manage your 
processes...

> 
> thanks for help!
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] how to setup single node cluster

2021-04-07 Thread d tbsky
Hi:
I found RHEL 8.2 support single node cluster now.  but I didn't
find further document to explain the concept. RHEL 8.2 also support
"disaster recovery cluster". so I think maybe a single node disaster
recovery cluster is not bad.

   I think corosync is still necessary under single node cluster. or
is there other new style of configuration?

thanks for help!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] "iscsi.service: Unit cannot be reloaded because it is inactive."

2021-04-07 Thread Jason Long
Hello,
Excuse me, when I rebooted my server, that problem appeared!
How to look at ocf:heartbeat:LVM and ocf:heartbeat:LVM-activate resource agents?

# pcs resource config
 Group: apache
  Resource: httpd_fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/mapper/vg_apache-lv_apache directory=/var/www 
fstype=ext4
   Operations: monitor interval=20s timeout=40s (httpd_fs-monitor-interval-20s)
               start interval=0s timeout=60s (httpd_fs-start-interval-0s)
               stop interval=0s timeout=60s (httpd_fs-stop-interval-0s)
  Resource: httpd_vip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=24 ip=192.168.56.100
   Operations: monitor interval=10s timeout=20s (httpd_vip-monitor-interval-10s)
               start interval=0s timeout=20s (httpd_vip-start-interval-0s)
               stop interval=0s timeout=20s (httpd_vip-stop-interval-0s)
  Resource: httpd_ser (class=ocf provider=heartbeat type=apache)
   Attributes: configfile=/etc/httpd/conf/httpd.conf 
statusurl=http://127.0.0.1/server-status
   Operations: monitor interval=10s timeout=20s (httpd_ser-monitor-interval-10s)
               start interval=0s timeout=40s (httpd_ser-start-interval-0s)
               stop interval=0s timeout=60s (httpd_ser-stop-interval-0s)








On Monday, April 5, 2021, 07:28:11 PM GMT+4:30, Ken Gaillot 
 wrote: 





On Sat, 2021-04-03 at 14:35 +, Jason Long wrote:
> Hello,
> I configure my clustering labs with three nodes. One of my nodes
> is iSCSI Shared Storage. Everything was OK until I restarted my iSCSI
> Shared Storage. On node1 I checked the status of my cluster:
> 
> # pcs status 
> Cluster name: mycluster
> Cluster Summary:
>  * Stack: corosync
>  * Current DC: node1 (version 2.0.5-10.fc33-ba59be7122) - partition
> with quorum
>  * Last updated: Sat Apr  3 18:45:49 2021
>  * Last change:  Mon Mar 29 19:36:35 2021 by root via cibadmin on
> node1
>  * 2 nodes configured
>  * 3 resource instances configured
> 
> 
> Node List:
>  * Online: [ node1 node2 ]
> 
> 
> Full List of Resources:
>  * Resource Group: apache:
>    * httpd_fs    (ocf::heartbeat:Filesystem):    Stopped
>    * httpd_vip    (ocf::heartbeat:IPaddr2):    Stopped
>    * httpd_ser    (ocf::heartbeat:apache):    Stopped
> 
> 
> Failed Resource Actions:
>  * httpd_fs_start_0 on node1 'not installed' (5): call=14,
> status='complete', exitreason='Couldn't find device
> [/dev/mapper/vg_apache-lv_apache]. Expected /dev/??? to exist', last-
> rc-change='2021-04-03 18:37:04 +04:30', queued=0ms, exec=502ms
>  * httpd_fs_start_0 on node2 'not installed' (5): call=14,
> status='complete', exitreason='Couldn't find device
> [/dev/mapper/vg_apache-lv_apache]. Expected /dev/??? to exist', last-
> rc-change='2021-04-03 18:37:05 +04:30', queued=0ms, exec=540ms

The above messages suggest that the web server file system requires the
vg_apache volume group and lv_apache logical volume to be active, but
they aren't. You may want to look at the ocf:heartbeat:LVM and
ocf:heartbeat:LVM-activate resource agents to bring these dependencies
into the cluster.

The ocf:heartbeat:iSCSILogicalUnit and ocf:heartbeat:iSCSITarget agents
may also be of interest.


> 
> Daemon Status:
>  corosync: active/enabled
>  pacemaker: active/enabled
>  pcsd: active/enabled
> 
> 
> And checked the iSCSI Shared Storage. It showed me below error:
> 
> [root@node3 ~]# systemctl status iscsi.service 
> ● iscsi.service - Login and scanning of iSCSI devices
>      Loaded: loaded (/usr/lib/systemd/system/iscsi.service; enabled;
> vendor preset: enabled)
>      Active: inactive (dead)
>  Condition: start condition failed at Sat 2021-04-03 18:49:08 +0430;
> 2s ago
>              └─ ConditionDirectoryNotEmpty=/var/lib/iscsi/nodes was
> not met
>        Docs: man:iscsiadm(8)
>              man:iscsid(8)
> 
> 
> Apr 03 18:39:17 node3.localhost.localdomain systemd[1]: Condition
> check resulted in Login and scanning of iSCSI devices being skipped.
> Apr 03 18:39:17 node3.localhost.localdomain systemd[1]:
> iscsi.service: Unit cannot be reloaded because it is inactive.
> Apr 03 18:39:17 node3.localhost.localdomain systemd[1]:
> iscsi.service: Unit cannot be reloaded because it is inactive.
> Apr 03 18:49:08 node3.localhost.localdomain systemd[1]: Condition
> check resulted in Login and scanning of iSCSI devices being skipped.
> 
> 
> Why "iscsi.service" is inactive? I tried to restart it, but it
> couldn't start!
> How to solve it?
> 
> Thanks.

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users


Re: [ClusterLabs] "iscsi.service: Unit cannot be reloaded because it is inactive."

2021-04-07 Thread Jason Long
Thank you.
The problem was that I forgot to open port 3260/tcp on my node1 and node2. I 
opened that port on my nodes and result is:

Full List of Resources:
    * Resource Group: apache:
    * httpd_fs    (ocf::heartbeat:Filesystem):     Started
    * httpd_vip    (ocf::heartbeat:IPaddr2):        Started
    * httpd_ser    (ocf::heartbeat:apache):        Started






On Monday, April 5, 2021, 07:28:11 PM GMT+4:30, Ken Gaillot 
 wrote: 





On Sat, 2021-04-03 at 14:35 +, Jason Long wrote:
> Hello,
> I configure my clustering labs with three nodes. One of my nodes
> is iSCSI Shared Storage. Everything was OK until I restarted my iSCSI
> Shared Storage. On node1 I checked the status of my cluster:
> 
> # pcs status 
> Cluster name: mycluster
> Cluster Summary:
>  * Stack: corosync
>  * Current DC: node1 (version 2.0.5-10.fc33-ba59be7122) - partition
> with quorum
>  * Last updated: Sat Apr  3 18:45:49 2021
>  * Last change:  Mon Mar 29 19:36:35 2021 by root via cibadmin on
> node1
>  * 2 nodes configured
>  * 3 resource instances configured
> 
> 
> Node List:
>  * Online: [ node1 node2 ]
> 
> 
> Full List of Resources:
>  * Resource Group: apache:
>    * httpd_fs    (ocf::heartbeat:Filesystem):    Stopped
>    * httpd_vip    (ocf::heartbeat:IPaddr2):    Stopped
>    * httpd_ser    (ocf::heartbeat:apache):    Stopped
> 
> 
> Failed Resource Actions:
>  * httpd_fs_start_0 on node1 'not installed' (5): call=14,
> status='complete', exitreason='Couldn't find device
> [/dev/mapper/vg_apache-lv_apache]. Expected /dev/??? to exist', last-
> rc-change='2021-04-03 18:37:04 +04:30', queued=0ms, exec=502ms
>  * httpd_fs_start_0 on node2 'not installed' (5): call=14,
> status='complete', exitreason='Couldn't find device
> [/dev/mapper/vg_apache-lv_apache]. Expected /dev/??? to exist', last-
> rc-change='2021-04-03 18:37:05 +04:30', queued=0ms, exec=540ms

The above messages suggest that the web server file system requires the
vg_apache volume group and lv_apache logical volume to be active, but
they aren't. You may want to look at the ocf:heartbeat:LVM and
ocf:heartbeat:LVM-activate resource agents to bring these dependencies
into the cluster.

The ocf:heartbeat:iSCSILogicalUnit and ocf:heartbeat:iSCSITarget agents
may also be of interest.


> 
> Daemon Status:
>  corosync: active/enabled
>  pacemaker: active/enabled
>  pcsd: active/enabled
> 
> 
> And checked the iSCSI Shared Storage. It showed me below error:
> 
> [root@node3 ~]# systemctl status iscsi.service 
> ● iscsi.service - Login and scanning of iSCSI devices
>      Loaded: loaded (/usr/lib/systemd/system/iscsi.service; enabled;
> vendor preset: enabled)
>      Active: inactive (dead)
>  Condition: start condition failed at Sat 2021-04-03 18:49:08 +0430;
> 2s ago
>              └─ ConditionDirectoryNotEmpty=/var/lib/iscsi/nodes was
> not met
>        Docs: man:iscsiadm(8)
>              man:iscsid(8)
> 
> 
> Apr 03 18:39:17 node3.localhost.localdomain systemd[1]: Condition
> check resulted in Login and scanning of iSCSI devices being skipped.
> Apr 03 18:39:17 node3.localhost.localdomain systemd[1]:
> iscsi.service: Unit cannot be reloaded because it is inactive.
> Apr 03 18:39:17 node3.localhost.localdomain systemd[1]:
> iscsi.service: Unit cannot be reloaded because it is inactive.
> Apr 03 18:49:08 node3.localhost.localdomain systemd[1]: Condition
> check resulted in Login and scanning of iSCSI devices being skipped.
> 
> 
> Why "iscsi.service" is inactive? I tried to restart it, but it
> couldn't start!
> How to solve it?
> 
> Thanks.

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: cluster-recheck-interval and failure-timeout

2021-04-07 Thread Antony Stone
On Wednesday 07 April 2021 at 10:40:54, Ulrich Windl wrote:

> >>> Ken Gaillot  schrieb am 06.04.2021 um 15:58
> > On Tue, 2021-04-06 at 09:15 +0200, Ulrich Windl wrote:

> >> Sorry I don't get it: If you have a timestamp for each failure-
> >> timeout, what's so hard to put all the fail counts that are older than
> >> failure-timeout on a list, and then reset that list to zero?
> > 
> > That's exactly the issue -- we don't have a timestamp for each failure.
> > Only the most recent failed operation, and the total fail count (per
> > resource and operation), are stored in the CIB status.
> > 
> > We could store all failures in the CIB, but that would be a significant
> > project, and we'd need new options to keep the current behavior as the
> > default.
> 
> I still don't quite get it: Some failing operation increases the
> fail-count, and the time stamp for the failing operation is recorded
> (crm_mon can display it). So solving this problem (saving the last time
> for each fail count) doesn't look so hard to do.

For the avoidance of doubt, I (who started this thread) have solved my problem 
by following the advice from Reid Wahl - I was putting the "failure-timeout" 
parameter into the incorrect section of mt resource definition.  Moving it to 
the "meta" section has resolved my problem.

The way it works now makes completely good sense to me:

1. A failure happens, and gets corrected.

2. Provided no further failure of that resource occurs within the failure-
timeout setting, the failure gets forgotten about.

3. If a further failure of the resource does occur within failure-timeout, the 
original timestamp is discarded, the failure count is incremented, and the 
timestamp of the new failure is used to check whether there's another failure 
within failure-timeout of *that*

4. If no further failure occurs within failure-timeout of the most recent 
failure timestamp, all previous failures are forgotten.

5. If enough failures occur within failure-timeout *of each other* then the 
failure count gets incremented to the point where the resource gets moved to 
another node.

Regards,


Antony.

-- 
"It wouldn't be a good idea to talk about him behind his back in front of 
him."

 - murble

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: Antw: [EXT] Re: cluster-recheck-interval and failure-timeout

2021-04-07 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 06.04.2021 um 15:58 in
Nachricht
:
> On Tue, 2021-04-06 at 09:15 +0200, Ulrich Windl wrote:
>> > > > Ken Gaillot  schrieb am 31.03.2021 um
>> > > > 15:48 in
>> 
>> Nachricht
>> <7dfc7c46442db17d9645854081f1269261518f84.ca...@redhat.com>:
>> > On Wed, 2021‑03‑31 at 14:32 +0200, Antony Stone wrote:
>> > > Hi.
>> > > 
>> > > I'm trying to understand what looks to me like incorrect
>> > > behaviour
>> > > between 
>> > > cluster‑recheck‑interval and failure‑timeout, under pacemaker
>> > > 2.0.1
>> > > 
>> > > I have three machines in a corosync (3.0.1 if it matters)
>> > > cluster,
>> > > managing 12 
>> > > resources in a single group.
>> > > 
>> > > I'm following documentation from:
>> > > 
>> > > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html/ 
>> > > Pacemaker_Explained/s‑cluster‑options.html
>> > > 
>> > > and
>> > > 
>> > > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html/ 
>> > > Pacemaker_Explained/s‑resource‑options.html
>> > > 
>> > > I have set a cluster property:
>> > > 
>> > >  cluster‑recheck‑interval=60s
>> > > 
>> > > I have set a resource property:
>> > > 
>> > >  failure‑timeout=180
>> > > 
>> > > The docs say failure‑timeout is "How many seconds to wait before
>> > > acting as if 
>> > > the failure had not occurred, and potentially allowing the
>> > > resource
>> > > back to 
>> > > the node on which it failed."
>> > > 
>> > > I think this should mean that if the resource fails and gets
>> > > restarted, the 
>> > > fact that it failed will be "forgotten" after 180 seconds (or
>> > > maybe a
>> > > little 
>> > > longer, depending on exactly when the next cluster recheck is
>> > > done).
>> > > 
>> > > However what I'm seeing is that if the resource fails and gets
>> > > restarted, and 
>> > > this then happens an hour later, it's still counted as two
>> > > failures.  If it 
>> > 
>> > That is exactly correct.
>> > 
>> > > fails and gets restarted another hour after that, it's recorded
>> > > as
>> > > three 
>> > > failures and (because I have "migration‑threshold=3") it gets
>> > > moved
>> > > to another 
>> > > node (and therefore all the other resources in group are moved as
>> > > well).
>> > > 
>> > > So, what am I misunderstanding about "failure‑timeout", and what
>> > > configuration 
>> > > setting do I need to use to tell pacemaker that "provided the
>> > > resource hasn't 
>> > > failed within the past X seconds, forget the fact that it failed
>> > > more
>> > > than X 
>> > > seconds ago"?
>> > 
>> > Unfortunately, there is no way. failure‑timeout expires *all*
>> > failures
>> > once the *most recent* is that old. It's a bit counter‑intuitive
>> > but
>> > currently, Pacemaker only remembers a resource's most recent
>> > failure
>> > and the total count of failures, and changing that would be a big
>> > project.
>> 
>> Hi!
>> 
>> Sorry I don't get it: If you have a timestamp for each failure-
>> timeout, what's
>> so hard to put all the fail counts that are older than failure-
>> timeout on a
>> list, and then reset that list to zero?
> 
> That's exactly the issue -- we don't have a timestamp for each failure.
> Only the most recent failed operation, and the total fail count (per
> resource and operation), are stored in the CIB status.
> 
> We could store all failures in the CIB, but that would be a significant
> project, and we'd need new options to keep the current behavior as the
> default.

Hi!

I still don't quite get it: Some failing operation increases the fail-count,
and the time stamp for the failing operation is recorded (crm_mon can display
it). So solving this problem (saving the last time for each fail count) doesn't
look so hard to do.

Regards,
Ulrich


> 
>> I mean: That would be what everyone expects.
>> What is implemented instead is like FIFO scheduling: As long as there
>> is a new
>> entry at the head of the queue, the jobs at the tail will never be
>> executed.
>> 
>> Regards,
>> Ulrich
>> 
>> > 
>> > 
>> > > Thanks,
>> > > 
>> > > 
>> > > Antony.
>> > > 
>> > 
>> > ‑‑ 
>> > Ken Gaillot 
>> > 
>> > ___
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > 
>> > ClusterLabs home: https://www.clusterlabs.org/ 
>> 
>> 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> -- 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/