This is very interesting. My apologies for missing this memory leak :-(. The code logs memory usage periodically exactly to help notice such a thing.
In my new open source project [http://assimmon.org], I am death on memory leaks. But I can assure you that back when that code was written, it was not at all clear who deleted what memory when - when it came to the glib. I'm not sure if valgrind was out back then, but I certainly didn't know about it. I confess that even on this new project I had a heck of a time making all the glib objects go away. I finally got them cleaned up - but it took weeks of running under valgrind before I worked out when to do what to make it throw the objects away - but not crash due to a bad reference. By the way, I suspect Lars' suggestion would work fine. I would certainly explain what the "better" patch is in the comments when you apply this one. On 05/02/2012 04:57 PM, renayama19661...@ybb.ne.jp wrote: > Hi Lars, > > Thank you for comments. > >>> And when it passes more than a full day.... >>> >>> * node1 >>> 32126 ? SLs 79:52 0 182 71189 24328 0.1 heartbeat: master >>> control process >>> >>> * node2 >>> 31928 ? SLs 77:01 0 182 70869 24008 0.1 heartbeat: master >>> control process >> Oh, I see. >> >> This is a "design choice" (maybe not even intentional) of the Gmain_* >> wrappers used throughout the heartbeat code. >> >> The "real" glib g_timeout_add_full(), and most other similar functions, >> internally do >> id = g_source_attach(source, ...); >> g_source_unref(source); >> return id; >> >> Thus in g_main_dispatch, the >> need_destroy = ! dispatch (...) >> if (need_destroy) >> g_source_destroy_internal() >> >> in fact ends up destroying it, >> if dispatch() returns FALSE, >> as documented: >> The function is called repeatedly until it returns FALSE, at >> which point the timeout is automatically destroyed and the >> function will not be called again. >> >> Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full. >> It does not g_source_unref(), so we keep the extra reference around >> until someone explicitly calls Gmain_timeout_remove(). >> >> Talk about principle of least surprise :( >> >> Changing this behaviour to match glib's, i.e. unref'ing after >> g_source_attach, would seem like the "correct" thing to do, >> but is likely to break other pieces of code in subtle ways, >> so it may not be the "right" thing to do at this point. > Thank you for detailed explanation. > If you found the method that is appropriate than the correction that I > suggested, I approve of it. > >> I'm going to take your patch more or less as is. >> If it does not show up soon, prod me again. >> > All right. > > Many Thanks! > Hideo Yamauchi. > > >> Thank you for tracking this down. >> >>> Best Regards, >>> Hideo Yamauchi. >>> >>> >>> --- On Tue, 2012/5/1, >>> renayama19661...@ybb.ne.jp<renayama19661...@ybb.ne.jp> wrote: >>> >>>> Hi Lars, >>>> >>>> We confirmed that this problem occurred with v1 mode of Heartbeat. >>>> * The problem happens with the v2 mode in the same way. >>>> >>>> We confirmed a problem in the next procedure. >>>> >>>> Step 1) Put a special device extinguishing a communication packet of >>>> Heartbeat in the network. >>>> >>>> Step 2) Between nodes, the retransmission of the message is carried out >>>> repeatedly. >>>> >>>> Step 3) Then the memory of the master process increases little by little. >>>> >>>> >>>> -------- As a result of the ps command of the master process ---------- >>>> * node1 >>>> (start) >>>> 32126 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master >>>> control process >>>> (One hour later) >>>> 32126 ? SLs 0:03 0 182 54729 7868 0.0 heartbeat: master >>>> control process >>>> (Two hour later) >>>> 32126 ? SLs 0:08 0 182 55317 8456 0.0 heartbeat: master >>>> control process >>>> (Four hours later) >>>> 32126 ? SLs 0:24 0 182 56673 9812 0.0 heartbeat: master >>>> control process >>>> >>>> * node2 >>>> (start) >>>> 31928 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master >>>> control process >>>> (One hour later) >>>> 31928 ? SLs 0:02 0 182 54481 7620 0.0 heartbeat: master >>>> control process >>>> (Two hour later) >>>> 31928 ? SLs 0:08 0 182 55353 8492 0.0 heartbeat: master >>>> control process >>>> (Four hours later) >>>> 31928 ? SLs 0:23 0 182 56689 9828 0.0 heartbeat: master >>>> control process >>>> >>>> >>>> The state of the memory leak seems to vary according to a node with the >>>> quantity of the retransmission. >>>> >>>> The increase of this memory disappears by applying my patch. >>>> >>>> And the similar correspondence seems to be necessary in >>>> send_reqnodes_msg(), but this is like little leak. >>>> >>>> Best Regards, >>>> Hideo Yamauchi. >>>> >>>> >>>> --- On Sat, 2012/4/28, >>>> renayama19661...@ybb.ne.jp<renayama19661...@ybb.ne.jp> wrote: >>>> >>>>> Hi Lars, >>>>> >>>>> Thank you for comments. >>>>> >>>>>> Have you actually been able to measure that memory leak you observed, >>>>>> and you can confirm this patch will fix it? >>>>>> >>>>>> Because I don't think this patch has any effect. >>>>> Yes. >>>>> I really measured leak. >>>>> I can show a result next week. >>>>> #Japan is a holiday until Tuesday. >>>>> >>>>>> send_rexmit_request() is only used as paramter to >>>>>> Gmain_timeout_add_full, and it returns FALSE always, >>>>>> which should cause the respective sourceid to be auto-removed. >>>>> It seems to be necessary to release gsource somehow or other. >>>>> The similar liberation seems to be carried out in lrmd. >>>>> >>>>> Best Regards, >>>>> Hideo Yamauchi. >>>>> >>>>> >>>>> --- On Fri, 2012/4/27, Lars Ellenberg<lars.ellenb...@linbit.com> wrote: >>>>> >>>>>> On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp >>>>>> wrote: >>>>>>> Hi All, >>>>>>> >>>>>>> We gave test that assumed remote cluster environment. >>>>>>> And we tested packet lost. >>>>>>> >>>>>>> The retransmission timer of Heartbeat causes memory leak. >>>>>>> >>>>>>> I donate a patch. >>>>>>> Please confirm the contents of the patch. >>>>>>> And please reflect a patch in a repository of Heartbeat. >>>>>> Have you actually been able to measure that memory leak you observed, >>>>>> and you can confirm this patch will fix it? >>>>>> >>>>>> Because I don't think this patch has any effect. >>>>>> >>>>>> send_rexmit_request() is only used as paramter to >>>>>> Gmain_timeout_add_full, and it returns FALSE always, >>>>>> which should cause the respective sourceid to be auto-removed. >>>>>> >>>>>> >>>>>>> diff -r 106ca984041b heartbeat/hb_rexmit.c >>>>>>> --- a/heartbeat/hb_rexmit.c Thu Apr 26 19:28:26 2012 +0900 >>>>>>> +++ b/heartbeat/hb_rexmit.c Thu Apr 26 19:31:44 2012 +0900 >>>>>>> @@ -164,6 +164,8 @@ >>>>>>> seqno_t seq = (seqno_t) ri->seq; >>>>>>> struct node_info* node = ri->node; >>>>>>> struct ha_msg* hmsg; >>>>>>> + unsigned long sourceid; >>>>>>> + gpointer value; >>>>>>> >>>>>>> if (STRNCMP_CONST(node->status, UPSTATUS) != 0&& >>>>>>> STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) { >>>>>>> @@ -196,11 +198,17 @@ >>>>>>> >>>>>>> node->track.last_rexmit_req = time_longclock(); >>>>>>> >>>>>>> - if (!g_hash_table_remove(rexmit_hash_table, ri)){ >>>>>>> - cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table" >>>>>>> - "for seq/node(%ld %s)", >>>>>>> - __FUNCTION__, ri->seq, ri->node->nodename); >>>>>>> - return FALSE; >>>>>>> + value = g_hash_table_lookup(rexmit_hash_table, ri); >>>>>>> + if ( value != NULL) { >>>>>>> + sourceid = (unsigned long) value; >>>>>>> + Gmain_timeout_remove(sourceid); >>>>>>> + >>>>>>> + if (!g_hash_table_remove(rexmit_hash_table, ri)){ >>>>>>> + cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table" >>>>>>> + "for seq/node(%ld %s)", >>>>>>> + __FUNCTION__, ri->seq, ri->node->nodename); >>>>>>> + return FALSE; >>>>>>> + } >>>>>>> } >>>>>>> >>>>>>> schedule_rexmit_request(node, seq, max_rexmit_delay); >> -- >> : Lars Ellenberg >> : LINBIT | Your Way to High Availability >> : DRBD/HA support and consulting http://www.linbit.com >> >> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. >> _______________________________________________________ >> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >> Home Page: http://linux-ha.org/ >> > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ -- Alan Robertson<al...@unix.sh> - @OSSAlanR "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/