Hi Lars, Thank you for comments.
> > > > And when it passes more than a full day.... > > > > * node1 > > 32126 ? SLs 79:52 0 182 71189 24328 0.1 heartbeat: master > > control process > > > > * node2 > > 31928 ? SLs 77:01 0 182 70869 24008 0.1 heartbeat: master > > control process > > Oh, I see. > > This is a "design choice" (maybe not even intentional) of the Gmain_* > wrappers used throughout the heartbeat code. > > The "real" glib g_timeout_add_full(), and most other similar functions, > internally do > id = g_source_attach(source, ...); > g_source_unref(source); > return id; > > Thus in g_main_dispatch, the > need_destroy = ! dispatch (...) > if (need_destroy) > g_source_destroy_internal() > > in fact ends up destroying it, > if dispatch() returns FALSE, > as documented: > The function is called repeatedly until it returns FALSE, at > which point the timeout is automatically destroyed and the > function will not be called again. > > Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full. > It does not g_source_unref(), so we keep the extra reference around > until someone explicitly calls Gmain_timeout_remove(). > > Talk about principle of least surprise :( > > Changing this behaviour to match glib's, i.e. unref'ing after > g_source_attach, would seem like the "correct" thing to do, > but is likely to break other pieces of code in subtle ways, > so it may not be the "right" thing to do at this point. Thank you for detailed explanation. If you found the method that is appropriate than the correction that I suggested, I approve of it. > I'm going to take your patch more or less as is. > If it does not show up soon, prod me again. > All right. Many Thanks! Hideo Yamauchi. > Thank you for tracking this down. > > > Best Regards, > > Hideo Yamauchi. > > > > > > --- On Tue, 2012/5/1, renayama19661...@ybb.ne.jp > > <renayama19661...@ybb.ne.jp> wrote: > > > > > Hi Lars, > > > > > > We confirmed that this problem occurred with v1 mode of Heartbeat. > > > * The problem happens with the v2 mode in the same way. > > > > > > We confirmed a problem in the next procedure. > > > > > > Step 1) Put a special device extinguishing a communication packet of > > > Heartbeat in the network. > > > > > > Step 2) Between nodes, the retransmission of the message is carried out > > > repeatedly. > > > > > > Step 3) Then the memory of the master process increases little by little. > > > > > > > > > -------- As a result of the ps command of the master process ---------- > > > * node1 > > > (start) > > > 32126 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: > > > master control process > > > (One hour later) > > > 32126 ? SLs 0:03 0 182 54729 7868 0.0 heartbeat: > > > master control process > > > (Two hour later) > > > 32126 ? SLs 0:08 0 182 55317 8456 0.0 heartbeat: > > > master control process > > > (Four hours later) > > > 32126 ? SLs 0:24 0 182 56673 9812 0.0 heartbeat: > > > master control process > > > > > > * node2 > > > (start) > > > 31928 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: > > > master control process > > > (One hour later) > > > 31928 ? SLs 0:02 0 182 54481 7620 0.0 heartbeat: > > > master control process > > > (Two hour later) > > > 31928 ? SLs 0:08 0 182 55353 8492 0.0 heartbeat: > > > master control process > > > (Four hours later) > > > 31928 ? SLs 0:23 0 182 56689 9828 0.0 heartbeat: > > > master control process > > > > > > > > > The state of the memory leak seems to vary according to a node with the > > > quantity of the retransmission. > > > > > > The increase of this memory disappears by applying my patch. > > > > > > And the similar correspondence seems to be necessary in > > > send_reqnodes_msg(), but this is like little leak. > > > > > > Best Regards, > > > Hideo Yamauchi. > > > > > > > > > --- On Sat, 2012/4/28, renayama19661...@ybb.ne.jp > > > <renayama19661...@ybb.ne.jp> wrote: > > > > > > > Hi Lars, > > > > > > > > Thank you for comments. > > > > > > > > > Have you actually been able to measure that memory leak you observed, > > > > > and you can confirm this patch will fix it? > > > > > > > > > > Because I don't think this patch has any effect. > > > > > > > > Yes. > > > > I really measured leak. > > > > I can show a result next week. > > > > #Japan is a holiday until Tuesday. > > > > > > > > > > > > > > send_rexmit_request() is only used as paramter to > > > > > Gmain_timeout_add_full, and it returns FALSE always, > > > > > which should cause the respective sourceid to be auto-removed. > > > > > > > > It seems to be necessary to release gsource somehow or other. > > > > The similar liberation seems to be carried out in lrmd. > > > > > > > > Best Regards, > > > > Hideo Yamauchi. > > > > > > > > > > > > --- On Fri, 2012/4/27, Lars Ellenberg <lars.ellenb...@linbit.com> wrote: > > > > > > > > > On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp > > > > > wrote: > > > > > > Hi All, > > > > > > > > > > > > We gave test that assumed remote cluster environment. > > > > > > And we tested packet lost. > > > > > > > > > > > > The retransmission timer of Heartbeat causes memory leak. > > > > > > > > > > > > I donate a patch. > > > > > > Please confirm the contents of the patch. > > > > > > And please reflect a patch in a repository of Heartbeat. > > > > > > > > > > Have you actually been able to measure that memory leak you observed, > > > > > and you can confirm this patch will fix it? > > > > > > > > > > Because I don't think this patch has any effect. > > > > > > > > > > send_rexmit_request() is only used as paramter to > > > > > Gmain_timeout_add_full, and it returns FALSE always, > > > > > which should cause the respective sourceid to be auto-removed. > > > > > > > > > > > > > > > > diff -r 106ca984041b heartbeat/hb_rexmit.c > > > > > > --- a/heartbeat/hb_rexmit.c Thu Apr 26 19:28:26 2012 +0900 > > > > > > +++ b/heartbeat/hb_rexmit.c Thu Apr 26 19:31:44 2012 +0900 > > > > > > @@ -164,6 +164,8 @@ > > > > > > seqno_t seq = (seqno_t) ri->seq; > > > > > > struct node_info* node = ri->node; > > > > > > struct ha_msg* hmsg; > > > > > > + unsigned long sourceid; > > > > > > + gpointer value; > > > > > > > > > > > > if (STRNCMP_CONST(node->status, UPSTATUS) != 0 && > > > > > > STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) { > > > > > > @@ -196,11 +198,17 @@ > > > > > > > > > > > > node->track.last_rexmit_req = time_longclock(); > > > > > > > > > > > > - if (!g_hash_table_remove(rexmit_hash_table, ri)){ > > > > > > - cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table" > > > > > > - "for seq/node(%ld %s)", > > > > > > - __FUNCTION__, ri->seq, ri->node->nodename); > > > > > > - return FALSE; > > > > > > + value = g_hash_table_lookup(rexmit_hash_table, ri); > > > > > > + if ( value != NULL) { > > > > > > + sourceid = (unsigned long) value; > > > > > > + Gmain_timeout_remove(sourceid); > > > > > > + > > > > > > + if (!g_hash_table_remove(rexmit_hash_table, ri)){ > > > > > > + cl_log(LOG_ERR, "%s: entry not found in > > > > > > rexmit_hash_table" > > > > > > + "for seq/node(%ld %s)", > > > > > > + __FUNCTION__, ri->seq, ri->node->nodename); > > > > > > + return FALSE; > > > > > > + } > > > > > > } > > > > > > > > > > > > schedule_rexmit_request(node, seq, max_rexmit_delay); > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ > _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/