This is very interesting.  My apologies for missing this memory leak 
:-(.  The code logs memory usage periodically exactly to help notice 
such a thing.

In my new open source project [http://assimmon.org], I am death on 
memory leaks.  But I can assure you that back when that code was 
written, it was not at all clear who deleted what memory when - when it 
came to the glib.  I'm not sure if valgrind was out back then, but I 
certainly didn't know about it.

I confess that even on this new project I had a heck of a time making 
all the glib objects go away.  I finally got them cleaned up - but it 
took weeks of running under valgrind before I worked out when to do what 
to make it throw the objects away - but not crash due to a bad reference.

By the way, I suspect Lars' suggestion would work fine.  I would 
certainly explain what the "better" patch is in the comments when you 
apply this one.


On 05/02/2012 04:57 PM, renayama19661...@ybb.ne.jp wrote:
> Hi Lars,
>
> Thank you for comments.
>
>>> And when it passes more than a full day....
>>>
>>> * node1
>>> 32126 ?        SLs   79:52      0   182 71189 24328  0.1 heartbeat: master 
>>> control process                       
>>>
>>> * node2
>>> 31928 ?        SLs   77:01      0   182 70869 24008  0.1 heartbeat: master 
>>> control process
>> Oh, I see.
>>
>> This is a "design choice" (maybe not even intentional) of the Gmain_*
>> wrappers used throughout the heartbeat code.
>>
>> The "real" glib g_timeout_add_full(), and most other similar functions,
>> internally do
>>   id = g_source_attach(source, ...);
>>   g_source_unref(source);
>>   return id;
>>
>> Thus in g_main_dispatch, the
>>   need_destroy = ! dispatch (...)
>>   if (need_destroy)
>>       g_source_destroy_internal()
>>
>> in fact ends up destroying it,
>> if dispatch() returns FALSE,
>> as documented:
>>      The function is called repeatedly until it returns FALSE, at
>>      which point the timeout is automatically destroyed and the
>>      function will not be called again.
>>
>> Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full.
>> It does not g_source_unref(), so we keep the extra reference around
>> until someone explicitly calls Gmain_timeout_remove().
>>
>> Talk about principle of least surprise :(
>>
>> Changing this behaviour to match glib's, i.e. unref'ing after
>> g_source_attach, would seem like the "correct" thing to do,
>> but is likely to break other pieces of code in subtle ways,
>> so it may not be the "right" thing to do at this point.
> Thank you for detailed explanation.
> If you found the method that is appropriate than the correction that I 
> suggested, I approve of it.
>
>> I'm going to take your patch more or less as is.
>> If it does not show up soon, prod me again.
>>
> All right.
>
> Many Thanks!
> Hideo Yamauchi.
>
>
>> Thank you for tracking this down.
>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>>
>>> --- On Tue, 2012/5/1, 
>>> renayama19661...@ybb.ne.jp<renayama19661...@ybb.ne.jp>  wrote:
>>>
>>>> Hi Lars,
>>>>
>>>> We confirmed that this problem occurred with v1 mode of Heartbeat.
>>>>    * The problem happens with the v2 mode in the same way.
>>>>
>>>> We confirmed a problem in the next procedure.
>>>>
>>>> Step 1) Put a special device extinguishing a communication packet of 
>>>> Heartbeat in the network.
>>>>
>>>> Step 2) Between nodes, the retransmission of the message is carried out 
>>>> repeatedly.
>>>>
>>>> Step 3) Then the memory of the master process increases little by little.
>>>>
>>>>
>>>> -------- As a result of the ps command of the master process ----------
>>>> * node1
>>>> (start)
>>>> 32126 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master 
>>>> control process
>>>> (One hour later)
>>>> 32126 ?        SLs    0:03      0   182 54729  7868  0.0 heartbeat: master 
>>>> control process
>>>> (Two hour later)
>>>> 32126 ?        SLs    0:08      0   182 55317  8456  0.0 heartbeat: master 
>>>> control process
>>>> (Four hours later)
>>>> 32126 ?        SLs    0:24      0   182 56673  9812  0.0 heartbeat: master 
>>>> control process
>>>>
>>>> * node2
>>>> (start)
>>>> 31928 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master 
>>>> control process
>>>> (One hour later)
>>>> 31928 ?        SLs    0:02      0   182 54481  7620  0.0 heartbeat: master 
>>>> control process
>>>> (Two hour later)
>>>> 31928 ?        SLs    0:08      0   182 55353  8492  0.0 heartbeat: master 
>>>> control process
>>>> (Four hours later)
>>>> 31928 ?        SLs    0:23      0   182 56689  9828  0.0 heartbeat: master 
>>>> control process
>>>>
>>>>
>>>> The state of the memory leak seems to vary according to a node with the 
>>>> quantity of the retransmission.
>>>>
>>>> The increase of this memory disappears by applying my patch.
>>>>
>>>> And the similar correspondence seems to be necessary in 
>>>> send_reqnodes_msg(), but this is like little leak.
>>>>
>>>> Best Regards,
>>>> Hideo Yamauchi.
>>>>
>>>>
>>>> --- On Sat, 2012/4/28, 
>>>> renayama19661...@ybb.ne.jp<renayama19661...@ybb.ne.jp>  wrote:
>>>>
>>>>> Hi Lars,
>>>>>
>>>>> Thank you for comments.
>>>>>
>>>>>> Have you actually been able to measure that memory leak you observed,
>>>>>> and you can confirm this patch will fix it?
>>>>>>
>>>>>> Because I don't think this patch has any effect.
>>>>> Yes.
>>>>> I really measured leak.
>>>>> I can show a result next week.
>>>>> #Japan is a holiday until Tuesday.
>>>>>
>>>>>> send_rexmit_request() is only used as paramter to
>>>>>> Gmain_timeout_add_full, and it returns FALSE always,
>>>>>> which should cause the respective sourceid to be auto-removed.
>>>>> It seems to be necessary to release gsource somehow or other.
>>>>> The similar liberation seems to be carried out in lrmd.
>>>>>
>>>>> Best Regards,
>>>>> Hideo Yamauchi.
>>>>>
>>>>>
>>>>> --- On Fri, 2012/4/27, Lars Ellenberg<lars.ellenb...@linbit.com>  wrote:
>>>>>
>>>>>> On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp 
>>>>>> wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We gave test that assumed remote cluster environment.
>>>>>>> And we tested packet lost.
>>>>>>>
>>>>>>> The retransmission timer of Heartbeat causes memory leak.
>>>>>>>
>>>>>>> I donate a patch.
>>>>>>> Please confirm the contents of the patch.
>>>>>>> And please reflect a patch in a repository of Heartbeat.
>>>>>> Have you actually been able to measure that memory leak you observed,
>>>>>> and you can confirm this patch will fix it?
>>>>>>
>>>>>> Because I don't think this patch has any effect.
>>>>>>
>>>>>> send_rexmit_request() is only used as paramter to
>>>>>> Gmain_timeout_add_full, and it returns FALSE always,
>>>>>> which should cause the respective sourceid to be auto-removed.
>>>>>>
>>>>>>
>>>>>>> diff -r 106ca984041b heartbeat/hb_rexmit.c
>>>>>>> --- a/heartbeat/hb_rexmit.c    Thu Apr 26 19:28:26 2012 +0900
>>>>>>> +++ b/heartbeat/hb_rexmit.c    Thu Apr 26 19:31:44 2012 +0900
>>>>>>> @@ -164,6 +164,8 @@
>>>>>>>        seqno_t seq = (seqno_t) ri->seq;
>>>>>>>        struct node_info* node = ri->node;
>>>>>>>        struct ha_msg*    hmsg;
>>>>>>> +    unsigned long           sourceid;
>>>>>>> +    gpointer value;
>>>>>>>   
>>>>>>>        if (STRNCMP_CONST(node->status, UPSTATUS) != 0&&
>>>>>>>            STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
>>>>>>> @@ -196,11 +198,17 @@
>>>>>>>       
>>>>>>>        node->track.last_rexmit_req = time_longclock();   
>>>>>>>       
>>>>>>> -    if (!g_hash_table_remove(rexmit_hash_table, ri)){
>>>>>>> -        cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
>>>>>>> -               "for seq/node(%ld %s)",                
>>>>>>> -               __FUNCTION__, ri->seq, ri->node->nodename);
>>>>>>> -        return FALSE;
>>>>>>> +    value = g_hash_table_lookup(rexmit_hash_table, ri);
>>>>>>> +    if ( value != NULL) {
>>>>>>> +        sourceid = (unsigned long) value;
>>>>>>> +        Gmain_timeout_remove(sourceid);
>>>>>>> +
>>>>>>> +        if (!g_hash_table_remove(rexmit_hash_table, ri)){
>>>>>>> +            cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
>>>>>>> +                       "for seq/node(%ld %s)",                
>>>>>>> +                       __FUNCTION__, ri->seq, ri->node->nodename);
>>>>>>> +            return FALSE;
>>>>>>> +        }
>>>>>>>        }
>>>>>>>       
>>>>>>>        schedule_rexmit_request(node, seq, max_rexmit_delay);
>> -- 
>> : Lars Ellenberg
>> : LINBIT | Your Way to High Availability
>> : DRBD/HA support and consulting http://www.linbit.com
>>
>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/


-- 
     Alan Robertson<al...@unix.sh>  - @OSSAlanR

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to