Re: [Openais] Problems forming cluster on corosync startup

Tim Beale Mon, 08 Aug 2011 00:16:27 -0700

Hi Steve,

Thanks for your help. I tried out your patch but the problem still
occurs. The problem looks to me due to the ring-IDs used when forming
the transitional memb-list, rather than with the memb-list itself. The
ring-ID of the nodes still in Recovery is older than the rest of the
nodes who have already shifted to Operational.


Attached is my attempt at fixing the problem. The idea is to delay the
nodes processing a Memb-Join immediately after shifting to
Operational, until the token has rotated the ring once.

It doesn't quite work either though. The nodes are still re-entering
gather before all have left recovery. This time it's due to processing
a Merge-Detect message. One node has just started up and set itself to
the rep, and sends out a Merge-Detect which triggers the other nodes
to enter gather and reform the ring.

Let me know if you have any other advice.

Thanks,
Tim

On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake <sd...@redhat.com> wrote:
> On 08/03/2011 10:32 PM, Tim Beale wrote:
>> Hi,
>>
>> It looks to me that the way the transition from Recovery to Operational 
>> works,
>> we can't guarantee that all nodes in the ring have entered Operational before
>> a node processes another Memb-Join message from a new node. E.g. we can't
>> guarantee the token has rotated right the way around the ring.
>>
>> When this happens, the nodes still in Recovery will still use the older ring
>> ID. So they won't get added to the transitional membership, and CLM will 
>> report
>> leave events for these nodes. (Plus there might be other side-effects, like 
>> the
>> FAILED TO RECEIVE problem - I haven't quite worked out why that's happening).
>>
>
> Thanks for the pointer here - patch on ml.
>
>> We are currently using CLM to check the health of a node, i.e. so we can 
>> detect
>> if it locks up. My questions are:
>> i) Are there config settings we could change to improve this, like increasing
>> the 'join' timeout?
>> ii) Should I try to make a code change to fix the problem? E.g. delay
>> processing the Memb-Join message if the node's only just entered operational.
>> iii) Should we not be using CLM like this? I.e. should we just learn to live
>> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly
>> healthy.
>>
>> Thanks for your help.
>> Tim
>>
>
> Tim please try the patch I have recently posted:
> [PATCH] Set my_new_memb_list in recovery enter
>
> First and foremost, let me know if it resolves your 10 node startup case
> which fails 10% of the time.  Then let me know if it treats other symptoms.
>
> Regards
> -steve
>
>
>> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <tlbe...@gmail.com> wrote:
>>> Hi,
>>>
>>> We're booting up a 10-node cluster (with all nodes starting corosync at 
>>> roughly
>>> the same time) and approx 1 in 10 times we see some problems:
>>> a) CLM is reporting nodes as leaving and then immediately rejoining (not 
>>> sure
>>> if this is valid behaviour?)
>>> b) Probably an unrelated oddity, but we're getting flow control enabled on a
>>> client daemon using CLM that's only sending one request 
>>> (saClmClusterTrack()).
>>> c) A node is hitting the FAILED TO RECEIVE case
>>> d) After c) there seems to be a lot of churn as the cluster tries to reform
>>> e) During the processing of node leave events, the CPG client can sometimes 
>>> get
>>> broken so it no longer processes *any* CPG events
>>>
>>> Corosync debug is attached (I commented out some of the noisier debug around
>>> message delivery). We don't really know enough about corosync to tell what
>>> exactly is incorrect behaviour and what should be fixed. But here's what 
>>> we've
>>> noticed:
>>> 1). Node-4 joins soon after node-1. When this happens all nodes except 
>>> node-12
>>> have entered operational state (see node-12.txt line 235). It looks like 
>>> maybe
>>> node-12 hasn't received enough rotations of the token to enter operational 
>>> yet.
>>> Node-12's resulting transitional config consists of just itself. All nodes 
>>> then
>>> report node-1 and node-12 as leaving and immediately rejoining.
>>> 2) After this config change, node-3 eventually hits the FAILED TO RECEIVE 
>>> case
>>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU matching
>>> the high_seq_received, all other nodes have an ARU of zero.
>>> 3) Node-3 entering gather seems to result in a lot of config change churn
>>> across the cluster.
>>> 4) While processing the config changes on node-3, the CPG downlist it uses
>>> contains itself. When node-3 sends leave events for the nodes in the 
>>> downlist
>>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED and 
>>> clears
>>> the cpd->group_name. This means it no longer sends any CPG events to the CPG
>>> client.
>>>
>>> We tried cherry-picking this commit to fix the problem (#4) with the CPG 
>>> client.
>>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577
>>> It helped a bit, but didn't fix it completely. We've made an interim change
>>> (attached) to avoid this problem.
>>>
>>> We're using corosync v1.3.1 on an embedded linux system (with a low-spec 
>>> CPU).
>>> Corosync is running over a basic ethernet interface (no hubs/routers/etc).
>>>
>>> Any help would be appreciated. Let me know if there's any other debug I can
>>> provide.
>>>
>
>
>
>>> Thanks,
>>> Tim
>>>
>> _______________________________________________
>> Openais mailing list
>> Openais@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
>
>

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 14625b1..c4dd571 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -505,6 +505,8 @@ struct totemsrp_instance {
 	totemsrp_stats_t stats;
 
 	uint32_t orf_token_discard;
+
+	uint32_t operational_firstpass;
 	
 	void * token_recv_event_handle;
 	void * token_sent_event_handle;
@@ -671,6 +673,8 @@ static void totemsrp_instance_initialize (struct totemsrp_instance *instance)
 
 	instance->orf_token_discard = 0;
 
+	instance->operational_firstpass = 0;
+
 	instance->commit_token = (struct memb_commit_token *)instance->commit_token_storage;
 }
 
@@ -1825,6 +1829,11 @@ static void memb_state_operational_enter (struct totemsrp_instance *instance)
 	instance->stats.operational_entered++;
 	instance->my_received_flg = 1;
 
+	/* Avoid processing Memb-Join msgs on the first token rotation after
+	 * entering Operational to ensure all nodes in the current ring have shifted
+	 * from Recovery to Operational. */
+	instance->operational_firstpass = 1;
+
 	reset_pause_timeout (instance);
 
 	/*
@@ -3609,6 +3618,7 @@ static int message_handler_orf_token (
 		break;
 
 	case MEMB_STATE_OPERATIONAL:
+		instance->operational_firstpass = 0;
 		messages_free (instance, token->aru);
 		/*
 		 * Do NOT add break, this case should also execute code in gather case.
@@ -4329,6 +4339,7 @@ static int message_handler_memb_join (
 {
 	const struct memb_join *memb_join;
 	struct memb_join *memb_join_convert = alloca (msg_len);
+	uint32_t known_memb_reforming_ring = 0;
 
 	if (endian_conversion_needed) {
 		memb_join = memb_join_convert;
@@ -4349,9 +4360,31 @@ static int message_handler_memb_join (
 	if (instance->token_ring_id_seq < memb_join->ring_seq) {
 		instance->token_ring_id_seq = memb_join->ring_seq;
 	}
+
+	/* Ignore Memb-Joins from new/unknown nodes (and retransmits) while we're
+	 * still in the process of forming a new ring. However, if the sender is
+	 * already known to us, it may try to reform the ring after token loss */
+	if (memb_set_subset (&memb_join->system_from,
+		1,
+		instance->my_new_memb_list,
+		instance->my_new_memb_entries) &&
+
+		memb_join->ring_seq >= instance->my_ring_id.seq) {
+
+		known_memb_reforming_ring = 1;
+	}
+
 	switch (instance->memb_state) {
 		case MEMB_STATE_OPERATIONAL:
-			memb_join_process (instance, memb_join);
+			/* ignore new joins for the first pass of the token after entering
+			 * Operational. This is to ensure no nodes are still in Recovery */
+			if (!instance->operational_firstpass ||
+				known_memb_reforming_ring) {
+				memb_join_process (instance, memb_join);
+			} else {
+				log_printf (instance->totemsrp_log_level_debug,
+					"### Ignoring Memb-Join on first-pass of operational token\n");
+			}	
 			break;
 
 		case MEMB_STATE_GATHER:
@@ -4359,12 +4392,7 @@ static int message_handler_memb_join (
 			break;
 
 		case MEMB_STATE_COMMIT:
-			if (memb_set_subset (&memb_join->system_from,
-				1,
-				instance->my_new_memb_list,
-				instance->my_new_memb_entries) &&
-
-				memb_join->ring_seq >= instance->my_ring_id.seq) {
+			if (known_memb_reforming_ring) {
 
 				memb_join_process (instance, memb_join);
 				memb_state_gather_enter (instance, 13);
@@ -4372,12 +4400,7 @@ static int message_handler_memb_join (
 			break;
 
 		case MEMB_STATE_RECOVERY:
-			if (memb_set_subset (&memb_join->system_from,
-				1,
-				instance->my_new_memb_list,
-				instance->my_new_memb_entries) &&
-
-				memb_join->ring_seq >= instance->my_ring_id.seq) {
+			if (known_memb_reforming_ring) {
 
 				memb_join_process (instance, memb_join);
 				memb_recovery_state_token_loss (instance);

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Problems forming cluster on corosync startup

Reply via email to