Hi! Maybe a configuration problem: Operation 'monitor' [3482074] using fence_node2 could not be executed: Timed Out In general I think there are too many errors. Could it be that a node is to be fenced, but fencing fails?
Do you use on-faul=blocked? crit: Cannot shut down node1.my.org because of pgsql-ha-vip: blocked I would look at the regular syslog, too. Kind regards, Ulrich Windl From: Users <users-boun...@clusterlabs.org> On Behalf Of Larry G. Mills via Users Sent: Tuesday, May 13, 2025 12:05 AM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org> Cc: Larry G. Mills <lgmi...@fnal.gov> Subject: [EXT] [ClusterLabs] Cluster (sometimes) hangs during shutdown - EL9 Hello all, I have a fairly simple two-node cluster that supports three resources - promotable Postgres, fencing, and virtual IP. This cluster is running on AlmaLinux 9.5 (RHEL9 variant). In recent months, I have noticed that the cluster will occasionally hang when shutting down. I use "pcs" to manage the cluster, so the shutdown command used is "pcs cluster stop -all". During the last hang, I observed that all the resources appeared to be shut down except the virtual IP - the VIP remained in the "Started" state, and the cluster remained running on the node where the VIP was running. I eventually was able to stop the cluster by issuing a "pcs cluster stop -all -request-timeout=1". I have been using this same cluster configuration (across multiple OS releases) for years, and have never experienced a shutdown hang before. Unfortunately, I can not reliably reproduce the scenario, but it has definitely happened on multiple occasions. Some config information: Linux node1.my.org 5.14.0-503.38.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 18 08:52:10 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux corosync.x86_64 3.1.8-2.el9 pacemaker.x86_64 2.1.8-3.el9 pcs.x86_64 0.11.8-1.el9_5.1.alma.1 Cluster constraints: Location Constraints: resource 'fence_node1' avoids node 'node1.my.orig' with score INFINITY resource 'fence_node2' avoids node 'node2.my.org' with score INFINITY Colocation Constraints: Started resource 'pgsql-ha-vip' with Promoted resource 'pgsql-clone' score=INFINITY Order Constraints: promote resource 'pgsql-clone' then start resource 'pgsql-ha-vip' symmetrical=0 kind=Mandatory demote resource 'pgsql-clone' then stop resource 'pgsql-ha-vip' symmetrical=0 kind=Mandatory Although I'm not super adept at parsing the pacemaker logs, the following error messages looked problematic: May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] (log_list_item) notice: Actions: Stop pgsql-ha-vip ( node1.my.org ) due to node availability (blocked) May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] (pcmk__create_graph) crit: Cannot shut down node1.my.org because of pgsql-ha-vip: blocked (pgsql-ha-vip_stop_0) A sanitized pacemaker log of the hang event is attached - 5/8/2025 @14:59. Is this a latent configuration problem that's just now showing up, or a problem with the pacemaker version's currently in EL9? Any thoughts appreciated, Larry Mills
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/