Hi Lars,

We confirmed that this problem occurred with v1 mode of Heartbeat.
 * The problem happens with the v2 mode in the same way.

We confirmed a problem in the next procedure.

Step 1) Put a special device extinguishing a communication packet of Heartbeat 
in the network.

Step 2) Between nodes, the retransmission of the message is carried out 
repeatedly.

Step 3) Then the memory of the master process increases little by little.


-------- As a result of the ps command of the master process ----------
* node1
(start)
32126 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master 
control process
(One hour later)
32126 ?        SLs    0:03      0   182 54729  7868  0.0 heartbeat: master 
control process
(Two hour later)
32126 ?        SLs    0:08      0   182 55317  8456  0.0 heartbeat: master 
control process
(Four hours later)
32126 ?        SLs    0:24      0   182 56673  9812  0.0 heartbeat: master 
control process 

* node2
(start)
31928 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master 
control process
(One hour later)
31928 ?        SLs    0:02      0   182 54481  7620  0.0 heartbeat: master 
control process
(Two hour later)
31928 ?        SLs    0:08      0   182 55353  8492  0.0 heartbeat: master 
control process
(Four hours later)
31928 ?        SLs    0:23      0   182 56689  9828  0.0 heartbeat: master 
control process


The state of the memory leak seems to vary according to a node with the 
quantity of the retransmission.

The increase of this memory disappears by applying my patch.

And the similar correspondence seems to be necessary in send_reqnodes_msg(), 
but this is like little leak.

Best Regards,
Hideo Yamauchi.


--- On Sat, 2012/4/28, [email protected] <[email protected]> 
wrote:

> Hi Lars,
> 
> Thank you for comments.
> 
> > Have you actually been able to measure that memory leak you observed,
> > and you can confirm this patch will fix it?
> > 
> > Because I don't think this patch has any effect.
> 
> Yes.
> I really measured leak.
> I can show a result next week.
> #Japan is a holiday until Tuesday.
> 
> > 
> > send_rexmit_request() is only used as paramter to
> > Gmain_timeout_add_full, and it returns FALSE always,
> > which should cause the respective sourceid to be auto-removed.
> 
> It seems to be necessary to release gsource somehow or other.
> The similar liberation seems to be carried out in lrmd.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> --- On Fri, 2012/4/27, Lars Ellenberg <[email protected]> wrote:
> 
> > On Thu, Apr 26, 2012 at 10:56:30AM +0900, [email protected] wrote:
> > > Hi All,
> > > 
> > > We gave test that assumed remote cluster environment.
> > > And we tested packet lost.
> > > 
> > > The retransmission timer of Heartbeat causes memory leak.
> > > 
> > > I donate a patch.
> > > Please confirm the contents of the patch.
> > > And please reflect a patch in a repository of Heartbeat.
> > 
> > Have you actually been able to measure that memory leak you observed,
> > and you can confirm this patch will fix it?
> > 
> > Because I don't think this patch has any effect.
> > 
> > send_rexmit_request() is only used as paramter to
> > Gmain_timeout_add_full, and it returns FALSE always,
> > which should cause the respective sourceid to be auto-removed.
> > 
> > 
> > > diff -r 106ca984041b heartbeat/hb_rexmit.c
> > > --- a/heartbeat/hb_rexmit.c    Thu Apr 26 19:28:26 2012 +0900
> > > +++ b/heartbeat/hb_rexmit.c    Thu Apr 26 19:31:44 2012 +0900
> > > @@ -164,6 +164,8 @@
> > >      seqno_t seq = (seqno_t) ri->seq;
> > >      struct node_info* node = ri->node;
> > >      struct ha_msg*    hmsg;
> > > +    unsigned long           sourceid;
> > > +    gpointer value;
> > >  
> > >      if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
> > >          STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> > > @@ -196,11 +198,17 @@
> > >      
> > >      node->track.last_rexmit_req = time_longclock();    
> > >      
> > > -    if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > -        cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > -               "for seq/node(%ld %s)",                
> > > -               __FUNCTION__, ri->seq, ri->node->nodename);
> > > -        return FALSE;
> > > +    value = g_hash_table_lookup(rexmit_hash_table, ri);
> > > +    if ( value != NULL) {
> > > +        sourceid = (unsigned long) value;
> > > +        Gmain_timeout_remove(sourceid);
> > > +
> > > +        if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > +            cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > +                       "for seq/node(%ld %s)",                
> > > +                       __FUNCTION__, ri->seq, ri->node->nodename);
> > > +            return FALSE;
> > > +        }
> > >      }
> > >      
> > >      schedule_rexmit_request(node, seq, max_rexmit_delay);
> > 
> > 
> > -- 
> > : Lars Ellenberg
> > : LINBIT | Your Way to High Availability
> > : DRBD/HA support and consulting http://www.linbit.com
> > 
> > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > _______________________________________________________
> > Linux-HA-Dev: [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> > 
> _______________________________________________________
> Linux-HA-Dev: [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
> 
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to