[ClusterLabs] Connections to attrd are fragile if it is restarted

Thomas Jones via Users Tue, 06 May 2025 13:29:02 -0700

I have seen an issue where the following sequence of events happens:

1. pacemaker-attrd becomes unresponsive to IPC for some reason.
2. pacemakerd detects this and kills it, then restarts it.
3. pacemaker-controld attempts to connect to pacemaker-attrd before it is 
finished starting and gets ECONNREFUSED, it treats this as a fatal error.
4. pacemakerd reaps the exit status of controld and starts shutting down.


Fencing was not configured so this resulted in loosing automation on the host, 
but either way this seems like something that pacemaker should handle.

In reproducing the issue by forcing attrd to be unresponsive by sending it 
SIGSTOP I have also observed it failing with a slight different problem.

1. pacemaker-attrd becomes unresponsive to IPC for some reason.
2. pacemaker-controld connects to attrd and sends it a request
3. pacemakerd kills attrd and restarts it.
4. pacemaker-controld  times out waiting for a response, recieves ENOTCONN, 
treats that as a fatal error and exits.
5. pacemakerd reaps the exit status of controld and starts shutting down.

I saw this in pacemaker 2.1.6, but from reading the code I believe pacemaker 
3.0 will behave identically if the timing lines up.

I have a patch for 2.1.6 that adds retrying to the existing retry loop in 
connect_and_send_attrd_request() for these two failures. Looking at the 
possible errors from the code paths I think it's likely only worth retrying for 
the error codes I mentioned.

The issue I have is that the refactoring that has been done since 2.1.6 means 
the retry loops for EAGAIN and EALRADY have been pushed down the stack, and 
just adding retries to pcmk__connect_ipc() wouldn't be a good idea since the 
daemons call it to check for existing instances of themselves and decidedly 
don't want to retry on ECONNREFUSED. Similarly I'm not sure about the 
implications of adding retries to pcmk__send_ipc_request()

I'd like to get a fix for this upstream since though the consequences are much 
less of a problem if fencing is enabled, this is still something pacemaker is 
supposed to be able to handle without going down.

Is the proper fix to add retries on ENOTCONN and ECONNREFUSED to every call 
site of connect_and_send_attrd_request() in one of the pacemaker daemons?

I am willing to work on a fix myself, but I'm wondering what it should look 
like to get accepted. Patch that I have against 2.1.6 is attached. Ideas for 
improvements in the fix for that version are also very welcome.

Thomas Jones
Software Developer
He/Him

IBM

From b8641166bd42e33e92a9de4ebc3953e4d34e863e Mon Sep 17 00:00:00 2001
From: Thomas Jones <thomas.jo...@ibm.com>
Date: Thu, 17 Apr 2025 17:51:39 -0400
Subject: [PATCH 1/5] fix possible fatal error if we get connection errors
 while attrd is being restarted by retrying.

---
 lib/common/ipc_attrd.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/lib/common/ipc_attrd.c b/lib/common/ipc_attrd.c
index 7c40aa7d5c..0d861de617 100644
--- a/lib/common/ipc_attrd.c
+++ b/lib/common/ipc_attrd.c
@@ -177,16 +177,21 @@ connect_and_send_attrd_request(pcmk_ipc_api_t *api, xmlNode *request)
     while (max > 0) {
         crm_info("Connecting to cluster... %d retries remaining", max);
         rc = pcmk_connect_ipc(api, pcmk_ipc_dispatch_sync);
-
         if (rc == pcmk_rc_ok) {
             rc = pcmk__send_ipc_request(api, request);
+            if(rc != pcmk_rc_ok) {
+                crm_err("Request to attrd failed: %s", pcmk_rc_str(rc));
+            }
+            if( rc != ENOTCONN) {
+                break;
+            }
+        } else if (rc != EAGAIN && rc != EALREADY && rc != ECONNREFUSED) {
+            crm_err("Could not connect to attrd: %s", pcmk_rc_str(rc));
             break;
-        } else if (rc == EAGAIN || rc == EALREADY) {
+        }
+        if(max > 0) { // avoid sleeping if we won't retry
             sleep(5 - max);
             max--;
-        } else {
-            crm_err("Could not connect to attrd: %s", pcmk_rc_str(rc));
-            break;
         }
     }
 
-- 
2.43.5

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Connections to attrd are fragile if it is restarted

Reply via email to