Hi!

Maybe a configuration problem:
Operation 'monitor' [3482074] using fence_node2 could not be executed: Timed Out
In general I think there are too many errors. Could it be that a node is to be 
fenced, but fencing fails?


Do you use on-faul=blocked?
crit: Cannot shut down node1.my.org because of pgsql-ha-vip: blocked

I would look at the regular syslog, too.

Kind regards,
Ulrich Windl

From: Users <users-boun...@clusterlabs.org> On Behalf Of Larry G. Mills via 
Users
Sent: Tuesday, May 13, 2025 12:05 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Cc: Larry G. Mills <lgmi...@fnal.gov>
Subject: [EXT] [ClusterLabs] Cluster (sometimes) hangs during shutdown - EL9

Hello all,

I have a fairly simple two-node cluster that supports three resources - 
promotable Postgres, fencing, and virtual IP.  This cluster is running on 
AlmaLinux 9.5 (RHEL9 variant).  In recent months, I have noticed that the 
cluster will occasionally hang when shutting down.  I use "pcs" to manage the 
cluster, so the shutdown command used is "pcs cluster stop -all".

During the last hang, I observed that all the resources appeared to be shut 
down except the virtual IP - the VIP remained in the "Started" state, and the 
cluster remained running on the node where the VIP was running.    I eventually 
was able to stop the cluster by issuing a "pcs cluster stop -all 
-request-timeout=1".

I have been using this same cluster configuration (across multiple OS releases) 
for years, and have never experienced a shutdown hang before.  Unfortunately, I 
can not reliably reproduce the scenario, but it has definitely happened on 
multiple occasions.


Some config information:

Linux node1.my.org 5.14.0-503.38.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 
18 08:52:10 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
corosync.x86_64                                                                 
       3.1.8-2.el9
pacemaker.x86_64                                                                
       2.1.8-3.el9
pcs.x86_64                                                                      
       0.11.8-1.el9_5.1.alma.1


Cluster constraints:

Location Constraints:
  resource 'fence_node1' avoids node 'node1.my.orig' with score INFINITY
  resource 'fence_node2' avoids node 'node2.my.org' with score INFINITY
Colocation Constraints:
  Started resource 'pgsql-ha-vip' with Promoted resource 'pgsql-clone'
    score=INFINITY
Order Constraints:
  promote resource 'pgsql-clone' then start resource 'pgsql-ha-vip'
    symmetrical=0 kind=Mandatory
  demote resource 'pgsql-clone' then stop resource 'pgsql-ha-vip'
    symmetrical=0 kind=Mandatory


Although I'm not super adept at parsing the pacemaker logs, the following error 
messages looked problematic:

May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] (log_list_item)     
notice: Actions: Stop       pgsql-ha-vip     ( node1.my.org )  due to node 
availability (blocked)
May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] 
(pcmk__create_graph)        crit: Cannot shut down node1.my.org because of 
pgsql-ha-vip: blocked (pgsql-ha-vip_stop_0)


A sanitized pacemaker log of the hang event is attached - 5/8/2025 @14:59.

Is this a latent configuration problem that's just now showing up, or a problem 
with the pacemaker version's currently in EL9?

Any thoughts appreciated,

Larry Mills
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to