On Sat, Apr 28, 2012 at 12:11 AM, Lars Ellenberg <lars.ellenb...@linbit.com> wrote: > On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp wrote: >> Hi All, >> >> We gave test that assumed remote cluster environment. >> And we tested packet lost. > > You may be interested in this patch I have lying around for ages. > > It may be incomplete for one corner case: > On a seriously misconfigured and overloaded system, > I have seen reports for a single send_local_status() > (that is basically one single send_cluster_msg()) > which took longer to execute than deadtime > (without even returning to the mainloop!). > > This cornercase should be handled with a watchdog. > But without a watchdog, and without stonith, > the CCM was confused, because one node saw a > leave then re-join after partition event, while the other node did not > even notice it had left and rejoined the membership... > and pacemaker ended up being DC on both :-/
A side effect of the ccm being "really confused" I assume? > > So I guess send_local_status() could do with an explicit call to > check_for_timeouts(), but that may need recursion protection. > > > I should really polish and push my queue some day soon... > > Cheers, > > > diff --git a/heartbeat/hb_rexmit.c b/heartbeat/hb_rexmit.c > --- a/heartbeat/hb_rexmit.c > +++ b/heartbeat/hb_rexmit.c > @@ -168,6 +168,7 @@ send_rexmit_request( gpointer data) > if (STRNCMP_CONST(node->status, UPSTATUS) != 0 && > STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) { > /* no point requesting rexmit from a dead node. */ > + g_hash_table_remove(rexmit_hash_table, ri); > return FALSE; > } > > @@ -243,7 +244,7 @@ schedule_rexmit_request(struct node_info > ri->seq = seq; > ri->node = node; > > - sourceid = Gmain_timeout_add_full(G_PRIORITY_HIGH - 1, delay, > + sourceid = Gmain_timeout_add_full(PRI_REXMIT, delay, > send_rexmit_request, ri, NULL); > G_main_setall_id(sourceid, "retransmit request", > config->heartbeat_ms/2, 10); > > diff --git a/heartbeat/heartbeat.c b/heartbeat/heartbeat.c > --- a/heartbeat/heartbeat.c > +++ b/heartbeat/heartbeat.c > @@ -1585,7 +1585,7 @@ master_control_process(void) > > send_local_status(); > > - if (G_main_add_input(G_PRIORITY_HIGH, FALSE, > + if (G_main_add_input(PRI_POLL, FALSE, > &polled_input_SourceFuncs) ==NULL){ > cl_log(LOG_ERR, "master_control_process: G_main_add_input > failed"); > } > diff --git a/include/hb_api_core.h b/include/hb_api_core.h > --- a/include/hb_api_core.h > +++ b/include/hb_api_core.h > @@ -40,6 +40,12 @@ > #define PRI_READPKT (PRI_SENDPKT+1) > #define PRI_FIFOMSG (PRI_READPKT+1) > > +/* PRI_POLL is where the timeout checks on deadtime happen. > + * Better be sure rexmit requests for lost packets > + * from a now dead node do not preempt detecting it as being dead. */ > +#define PRI_POLL (G_PRIORITY_HIGH) > +#define PRI_REXMIT PRI_POLL > + > #define PRI_CHECKSIGS (G_PRIORITY_DEFAULT) > #define PRI_FREEMSG (PRI_CHECKSIGS+1) > #define PRI_CLIENTMSG (PRI_FREEMSG+1) > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/