---------- Forwarded message ----------
From: Escobio, Roger <[EMAIL PROTECTED]>
Date: Thu, Sep 11, 2008 at 6:49 AM
Subject: RE: [Ganglia-general] Anyone experience petabyte peaks in
network metric in ganglia 3.x.y ?
To: Martin Knoblauch <[EMAIL PROTECTED]>, Bernard Li <[EMAIL PROTECTED]>



> -----Original Message-----
> From: Martin Knoblauch [mailto:[EMAIL PROTECTED]
> Sent: September 11, 2008 6:58 AM
> To: Escobio, Roger [CMB-IT]; Bernard Li
> Cc: ganglia-developers@lists.sourceforge.net
> Subject: Re: [Ganglia-general] Anyone experience petabyte
> peaks in network metric in ganglia 3.x.y ?
>
> ----- Original Message ----
>
> > From: "Escobio, Roger " <[EMAIL PROTECTED]>
> > To: Bernard Li <[EMAIL PROTECTED]>
> > Cc: [EMAIL PROTECTED]
> > Sent: Wednesday, September 10, 2008 9:19:47 PM
> > Subject: Re: [Ganglia-general] Anyone experience petabyte
> peaks in network metric in ganglia 3.x.y ?
> >
> >
> >
> > > -----Original Message-----
> > > From: Bernard Li [mailto:[EMAIL PROTECTED]
> > > Sent: September 10, 2008 3:05 PM
> > > To: Escobio, Roger [CMB-IT]
> > > Cc: [EMAIL PROTECTED]
> > > Subject: Re: [Ganglia-general] Anyone experience petabyte
> > > peaks in network metric in ganglia 3.x.y ?
> > >
> > > Hi Roger:
> > >
> > > On Wed, Sep 10, 2008 at 11:52 AM, Martin Knoblauch
> > > wrote:
> > >
> > > >> I created a patch again linux/metrics.c (3.1.1
> version) to add the
> > > >> counterdiff function found in *bsd/metrics.c
> > > >> Are you interested in it? Just let me know and I'll send
> > > it to the list
> > > >>
> > > >
> > > >  Yes please. I am definitely like to have a look at your patch.
> > >
> > > In case the patch is too large to be sent to the mailing-list, you
> > > could also file a bug and upload the patch via
> bugzilla.ganglia.info.
> >
> > Well, is not big
> > As I said, it is just a copy paste from *bsd code , so not
> big deal :-)
> >
> > But at least compile and not coredump gmond in our linux :-)
> >
> Roger, [changed Mailing List to ganglia-developers]
>
>  just to understand things right, your patch is only a code
> cleanup and you still need the "#ifdef REMOVE_BOGUS_SPIKES"
> to get rid of the spikes. Correct?

Yes, probably :-(
We do have peaks (normal work/load) so we can't go with that solution
We need to do something that avoid this un-sense data
Of course, if it is fixed in the root of the problem (kernel driver
perhaps) will be better
But, still, I think ganglia could do some sanity checks like compare the
counterdiff with a max value the user set in the conf file (metric
module conf)
If I had free time maybe I can work on that, but the random appearance
of those petabytes peaks make this problem difficult to troubleshoot.

>
>  Some comments on the patch:

Once again, I just copy paste from another ganglia file metrics.c file
(freebsd I think) and made some cosmetic changes so,
You should be right about your comments because I do not understand the
ganglia code yet (very-very new to it)

>
> >--- libmetrics/linux/metrics.c-ori      2008-09-09
> 18:54:40.000000000 +0000
> >+++ libmetrics/linux/metrics.c  2008-09-09 19:09:44.000000000 +0000
> >@@ -222,40 +222,20 @@
> >                     if ( !ns ) return;
> >
> >                     rbi = strtoul( p, &p ,10);
> >-                    if ( rbi >= ns->rbi ) {
> >-                       l_bytes_in += rbi - ns->rbi;
> >-                    } else {
> >-                       debug_msg("update_ifdata(%s) -
> Overflow in rbi: %lu -> %lu",caller,ns->rbi,rbi);
> >-                       l_bytes_in += ULONG_MAX - ns->rbi + rbi;
> >-                    }
> >+                    l_bytes_in =
> counterdiff(rbi,ns->rbi,ULONG_MAX, 0);
>
>  Shouldn't that be "+= counterdiff/..."? l_bytes_in is
> cummulated over all NICs.

You right, I can changed to += or follow what you have in *bsd
In BSD you cumulate when make the rate (line 1223 of freebsd/metrics.c
for example)
But definitely I need to add the "+" at some point :-)
I will do it where you recommend because is the most un-disturbing place
to do it :-)


> >                     ns->rpo = rpo;
> >                  }
> >               p = index (p, '\n') + 1;    // skips a line
> >@@ -1305,3 +1285,40 @@
> >    val.f = most_full;
> >    return val;
> > }
> >+
> >+static unsigned long
> >+counterdiff(unsigned long oldval, unsigned long newval,
> unsigned long maxval, unsigned long maxdiff)
> >+{
> >+       unsigned long diff;
> >+
> >+       if (maxdiff == 0)
> >+               maxdiff = maxval;
> >+
> >+       /* Paranoia */
> >+       if (oldval > maxval || newval > maxval)
> >+               return 0;
>
> Really cannot happen with maxval being ULONG_MAX. Even the
> paranoid should feel safe here :-)
>
> >+
> >+       /*
> >+        * Tackle the easy case.  Don't worry about maxdiff
> here because
> >+        * we're SOL if it happens (i.e. assuming a reset just makes
> >+        * matters worse).This
> >+        */
> >+       if (oldval <= newval)
> >+               return (newval - oldval);
> >+
> >+       /*
> >+        * Now the tricky part.  If we assume counters never
> get reset,
> >+        * this is easy.  Unfortunaly, they do get reset on some
> >+        * systems, so we need to try and deal with that.
> Our huristic
> >+        * is that if out difference is greater then maxdiff
> and newval
> >+        * is less or equal to maxdiff, then we've probably
> been reset
> >+        * rather then actually wrapping.  Obviously, you need to be
> >+        * careful to poll often enough that you won't
> exceed maxdiff or
> >+        * you will get undersized numbers when you do wrap.
> >+        */
> >+       diff = maxval - oldval + newval;
> >+       if (diff > maxdiff && newval <= maxdiff)
> >+               return newval;
>
> "diff > maxdiff" cannot happen, as maxdiff is ULONG_MAX
>
> >+
> >+       return diff;
> >+}

Actually, I didn't add any checkout to the counterdiff function that
wasn't there before :-)
(in the bsd code ) or are those checks valid in BSD only?
But I think this is the best place to place any sanity check we might
want to add, right?

Actually, now that I saw the BSD code more close, I fail to "copy&paste
and addapt" the line where it is defined the counterdiff function

BSD:
static uint64_t counterdiff(uint64_t, uint64_t, uint64_t, uint64_t);

Linux:
static unsigned long counterdiff(unsigned long, unsigned long, unsigned
long, unsigned long);

Is that correct?
Thanks
Roger
PD: I am not subscribed to ganglia-developers so, please, can you
forward this email to there? Thanks
PD: I am sending a new patch to you in this email, thanks

Attachment: ganglia-libmetrics_linux_metrics.patch
Description: Binary data

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to