Hi,

I'm resending this patch in a separate thread because I think this part of the
cluster formation problems I'm seeing has been overlooked. The patch attached
is one way of addressing the problem, but I'm open to alternatives.

Basically the problem is that if the cluster experiences formation problems,
then CPG can sometimes choose a downlist that includes the local node. When
it processes the node leave event for itself it sets its cpd state to
CPD_STATE_UNJOINED and clears the cpd->group_name. This means CPG events are no
longer sent to the CPG client, because the cpd->group_name no longer matches.

This patch avoids the problem by only clearing the group_name if cpg_leave() is
called and not when processing a downlist leave event. I'm not 100% sure about
the case where the CPG client exits unexpectedly (in which case the reason is
also CONFCHG_CPG_REASON_PROCDOWN), but I figure the cpd info gets cleaned up
immediately on the local node if this happens.

Regards,
Tim

---

 services/cpg.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/services/cpg.c b/services/cpg.c
index 8e71dcf..c66037b 100644
--- a/services/cpg.c
+++ b/services/cpg.c
@@ -683,7 +683,8 @@ static int notify_lib_joinlist(
                                }
                                if (left_list_entries) {
                                        if (left_list[0].pid == cpd->pid &&
-                                               left_list[0].nodeid == 
api->totem_nodeid_get()) {
+                                               left_list[0].nodeid == 
api->totem_nodeid_get() &&
+                                               left_list[0].reason == 
CONFCHG_CPG_REASON_LEAVE) {

                                                cpd->pid = 0;
                                                memset (&cpd->group_name, 0, 
sizeof(cpd->group_name));
From: Tim Beale <tim.be...@alliedtelesis.co.nz>

A CPG client can sometimes lockup if the local node is in the downlist

In a 10-node cluster where all nodes are booting up and starting corosync
at the same time, sometimes during this process corosync detects a node as
leaving and rejoining the cluster.

Occasionally the downlist that gets picked contains the local node. When the
local node sends leave events for the downlist (including itself), it sets
its cpd state to CPD_STATE_UNJOINED and clears the cpd->group_name. This
means it no longer sends CPG events to the CPG client.

---

 services/cpg.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/services/cpg.c b/services/cpg.c
index 8e71dcf..c66037b 100644
--- a/services/cpg.c
+++ b/services/cpg.c
@@ -683,7 +683,8 @@ static int notify_lib_joinlist(
 				}
 				if (left_list_entries) {
 					if (left_list[0].pid == cpd->pid &&
-						left_list[0].nodeid == api->totem_nodeid_get()) {
+						left_list[0].nodeid == api->totem_nodeid_get() &&
+						left_list[0].reason == CONFCHG_CPG_REASON_LEAVE) {
 
 						cpd->pid = 0;
 						memset (&cpd->group_name, 0, sizeof(cpd->group_name));
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to