On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp wrote: > Hi All, > > We gave test that assumed remote cluster environment. > And we tested packet lost.
You may be interested in this patch I have lying around for ages. It may be incomplete for one corner case: On a seriously misconfigured and overloaded system, I have seen reports for a single send_local_status() (that is basically one single send_cluster_msg()) which took longer to execute than deadtime (without even returning to the mainloop!). This cornercase should be handled with a watchdog. But without a watchdog, and without stonith, the CCM was confused, because one node saw a leave then re-join after partition event, while the other node did not even notice it had left and rejoined the membership... and pacemaker ended up being DC on both :-/ So I guess send_local_status() could do with an explicit call to check_for_timeouts(), but that may need recursion protection. I should really polish and push my queue some day soon... Cheers, diff --git a/heartbeat/hb_rexmit.c b/heartbeat/hb_rexmit.c --- a/heartbeat/hb_rexmit.c +++ b/heartbeat/hb_rexmit.c @@ -168,6 +168,7 @@ send_rexmit_request( gpointer data) if (STRNCMP_CONST(node->status, UPSTATUS) != 0 && STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) { /* no point requesting rexmit from a dead node. */ + g_hash_table_remove(rexmit_hash_table, ri); return FALSE; } @@ -243,7 +244,7 @@ schedule_rexmit_request(struct node_info ri->seq = seq; ri->node = node; - sourceid = Gmain_timeout_add_full(G_PRIORITY_HIGH - 1, delay, + sourceid = Gmain_timeout_add_full(PRI_REXMIT, delay, send_rexmit_request, ri, NULL); G_main_setall_id(sourceid, "retransmit request", config->heartbeat_ms/2, 10); diff --git a/heartbeat/heartbeat.c b/heartbeat/heartbeat.c --- a/heartbeat/heartbeat.c +++ b/heartbeat/heartbeat.c @@ -1585,7 +1585,7 @@ master_control_process(void) send_local_status(); - if (G_main_add_input(G_PRIORITY_HIGH, FALSE, + if (G_main_add_input(PRI_POLL, FALSE, &polled_input_SourceFuncs) ==NULL){ cl_log(LOG_ERR, "master_control_process: G_main_add_input failed"); } diff --git a/include/hb_api_core.h b/include/hb_api_core.h --- a/include/hb_api_core.h +++ b/include/hb_api_core.h @@ -40,6 +40,12 @@ #define PRI_READPKT (PRI_SENDPKT+1) #define PRI_FIFOMSG (PRI_READPKT+1) +/* PRI_POLL is where the timeout checks on deadtime happen. + * Better be sure rexmit requests for lost packets + * from a now dead node do not preempt detecting it as being dead. */ +#define PRI_POLL (G_PRIORITY_HIGH) +#define PRI_REXMIT PRI_POLL + #define PRI_CHECKSIGS (G_PRIORITY_DEFAULT) #define PRI_FREEMSG (PRI_CHECKSIGS+1) #define PRI_CLIENTMSG (PRI_FREEMSG+1) _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/