Hi, I'm resending this patch in a separate thread because I think this part of the cluster formation problems I'm seeing has been overlooked. The patch attached is one way of addressing the problem, but I'm open to alternatives.
Basically the problem is that if the cluster experiences formation problems, then CPG can sometimes choose a downlist that includes the local node. When it processes the node leave event for itself it sets its cpd state to CPD_STATE_UNJOINED and clears the cpd->group_name. This means CPG events are no longer sent to the CPG client, because the cpd->group_name no longer matches. This patch avoids the problem by only clearing the group_name if cpg_leave() is called and not when processing a downlist leave event. I'm not 100% sure about the case where the CPG client exits unexpectedly (in which case the reason is also CONFCHG_CPG_REASON_PROCDOWN), but I figure the cpd info gets cleaned up immediately on the local node if this happens. Regards, Tim --- services/cpg.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/services/cpg.c b/services/cpg.c index 8e71dcf..c66037b 100644 --- a/services/cpg.c +++ b/services/cpg.c @@ -683,7 +683,8 @@ static int notify_lib_joinlist( } if (left_list_entries) { if (left_list[0].pid == cpd->pid && - left_list[0].nodeid == api->totem_nodeid_get()) { + left_list[0].nodeid == api->totem_nodeid_get() && + left_list[0].reason == CONFCHG_CPG_REASON_LEAVE) { cpd->pid = 0; memset (&cpd->group_name, 0, sizeof(cpd->group_name));
From: Tim Beale <tim.be...@alliedtelesis.co.nz> A CPG client can sometimes lockup if the local node is in the downlist In a 10-node cluster where all nodes are booting up and starting corosync at the same time, sometimes during this process corosync detects a node as leaving and rejoining the cluster. Occasionally the downlist that gets picked contains the local node. When the local node sends leave events for the downlist (including itself), it sets its cpd state to CPD_STATE_UNJOINED and clears the cpd->group_name. This means it no longer sends CPG events to the CPG client. --- services/cpg.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/services/cpg.c b/services/cpg.c index 8e71dcf..c66037b 100644 --- a/services/cpg.c +++ b/services/cpg.c @@ -683,7 +683,8 @@ static int notify_lib_joinlist( } if (left_list_entries) { if (left_list[0].pid == cpd->pid && - left_list[0].nodeid == api->totem_nodeid_get()) { + left_list[0].nodeid == api->totem_nodeid_get() && + left_list[0].reason == CONFCHG_CPG_REASON_LEAVE) { cpd->pid = 0; memset (&cpd->group_name, 0, sizeof(cpd->group_name));
_______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais