I have seen an issue where the following sequence of events happens: 1. pacemaker-attrd becomes unresponsive to IPC for some reason. 2. pacemakerd detects this and kills it, then restarts it. 3. pacemaker-controld attempts to connect to pacemaker-attrd before it is finished starting and gets ECONNREFUSED, it treats this as a fatal error. 4. pacemakerd reaps the exit status of controld and starts shutting down.
Fencing was not configured so this resulted in loosing automation on the host, but either way this seems like something that pacemaker should handle. In reproducing the issue by forcing attrd to be unresponsive by sending it SIGSTOP I have also observed it failing with a slight different problem. 1. pacemaker-attrd becomes unresponsive to IPC for some reason. 2. pacemaker-controld connects to attrd and sends it a request 3. pacemakerd kills attrd and restarts it. 4. pacemaker-controld times out waiting for a response, recieves ENOTCONN, treats that as a fatal error and exits. 5. pacemakerd reaps the exit status of controld and starts shutting down. I saw this in pacemaker 2.1.6, but from reading the code I believe pacemaker 3.0 will behave identically if the timing lines up. I have a patch for 2.1.6 that adds retrying to the existing retry loop in connect_and_send_attrd_request() for these two failures. Looking at the possible errors from the code paths I think it's likely only worth retrying for the error codes I mentioned. The issue I have is that the refactoring that has been done since 2.1.6 means the retry loops for EAGAIN and EALRADY have been pushed down the stack, and just adding retries to pcmk__connect_ipc() wouldn't be a good idea since the daemons call it to check for existing instances of themselves and decidedly don't want to retry on ECONNREFUSED. Similarly I'm not sure about the implications of adding retries to pcmk__send_ipc_request() I'd like to get a fix for this upstream since though the consequences are much less of a problem if fencing is enabled, this is still something pacemaker is supposed to be able to handle without going down. Is the proper fix to add retries on ENOTCONN and ECONNREFUSED to every call site of connect_and_send_attrd_request() in one of the pacemaker daemons? I am willing to work on a fix myself, but I'm wondering what it should look like to get accepted. Patch that I have against 2.1.6 is attached. Ideas for improvements in the fix for that version are also very welcome. Thomas Jones Software Developer He/Him IBM
From b8641166bd42e33e92a9de4ebc3953e4d34e863e Mon Sep 17 00:00:00 2001 From: Thomas Jones <thomas.jo...@ibm.com> Date: Thu, 17 Apr 2025 17:51:39 -0400 Subject: [PATCH 1/5] fix possible fatal error if we get connection errors while attrd is being restarted by retrying. --- lib/common/ipc_attrd.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/lib/common/ipc_attrd.c b/lib/common/ipc_attrd.c index 7c40aa7d5c..0d861de617 100644 --- a/lib/common/ipc_attrd.c +++ b/lib/common/ipc_attrd.c @@ -177,16 +177,21 @@ connect_and_send_attrd_request(pcmk_ipc_api_t *api, xmlNode *request) while (max > 0) { crm_info("Connecting to cluster... %d retries remaining", max); rc = pcmk_connect_ipc(api, pcmk_ipc_dispatch_sync); - if (rc == pcmk_rc_ok) { rc = pcmk__send_ipc_request(api, request); + if(rc != pcmk_rc_ok) { + crm_err("Request to attrd failed: %s", pcmk_rc_str(rc)); + } + if( rc != ENOTCONN) { + break; + } + } else if (rc != EAGAIN && rc != EALREADY && rc != ECONNREFUSED) { + crm_err("Could not connect to attrd: %s", pcmk_rc_str(rc)); break; - } else if (rc == EAGAIN || rc == EALREADY) { + } + if(max > 0) { // avoid sleeping if we won't retry sleep(5 - max); max--; - } else { - crm_err("Could not connect to attrd: %s", pcmk_rc_str(rc)); - break; } } -- 2.43.5
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/