---------- Forwarded message ---------- From: Escobio, Roger <[EMAIL PROTECTED]> Date: Thu, Sep 11, 2008 at 6:49 AM Subject: RE: [Ganglia-general] Anyone experience petabyte peaks in network metric in ganglia 3.x.y ? To: Martin Knoblauch <[EMAIL PROTECTED]>, Bernard Li <[EMAIL PROTECTED]>
> -----Original Message----- > From: Martin Knoblauch [mailto:[EMAIL PROTECTED] > Sent: September 11, 2008 6:58 AM > To: Escobio, Roger [CMB-IT]; Bernard Li > Cc: ganglia-developers@lists.sourceforge.net > Subject: Re: [Ganglia-general] Anyone experience petabyte > peaks in network metric in ganglia 3.x.y ? > > ----- Original Message ---- > > > From: "Escobio, Roger " <[EMAIL PROTECTED]> > > To: Bernard Li <[EMAIL PROTECTED]> > > Cc: [EMAIL PROTECTED] > > Sent: Wednesday, September 10, 2008 9:19:47 PM > > Subject: Re: [Ganglia-general] Anyone experience petabyte > peaks in network metric in ganglia 3.x.y ? > > > > > > > > > -----Original Message----- > > > From: Bernard Li [mailto:[EMAIL PROTECTED] > > > Sent: September 10, 2008 3:05 PM > > > To: Escobio, Roger [CMB-IT] > > > Cc: [EMAIL PROTECTED] > > > Subject: Re: [Ganglia-general] Anyone experience petabyte > > > peaks in network metric in ganglia 3.x.y ? > > > > > > Hi Roger: > > > > > > On Wed, Sep 10, 2008 at 11:52 AM, Martin Knoblauch > > > wrote: > > > > > > >> I created a patch again linux/metrics.c (3.1.1 > version) to add the > > > >> counterdiff function found in *bsd/metrics.c > > > >> Are you interested in it? Just let me know and I'll send > > > it to the list > > > >> > > > > > > > > Yes please. I am definitely like to have a look at your patch. > > > > > > In case the patch is too large to be sent to the mailing-list, you > > > could also file a bug and upload the patch via > bugzilla.ganglia.info. > > > > Well, is not big > > As I said, it is just a copy paste from *bsd code , so not > big deal :-) > > > > But at least compile and not coredump gmond in our linux :-) > > > Roger, [changed Mailing List to ganglia-developers] > > just to understand things right, your patch is only a code > cleanup and you still need the "#ifdef REMOVE_BOGUS_SPIKES" > to get rid of the spikes. Correct? Yes, probably :-( We do have peaks (normal work/load) so we can't go with that solution We need to do something that avoid this un-sense data Of course, if it is fixed in the root of the problem (kernel driver perhaps) will be better But, still, I think ganglia could do some sanity checks like compare the counterdiff with a max value the user set in the conf file (metric module conf) If I had free time maybe I can work on that, but the random appearance of those petabytes peaks make this problem difficult to troubleshoot. > > Some comments on the patch: Once again, I just copy paste from another ganglia file metrics.c file (freebsd I think) and made some cosmetic changes so, You should be right about your comments because I do not understand the ganglia code yet (very-very new to it) > > >--- libmetrics/linux/metrics.c-ori 2008-09-09 > 18:54:40.000000000 +0000 > >+++ libmetrics/linux/metrics.c 2008-09-09 19:09:44.000000000 +0000 > >@@ -222,40 +222,20 @@ > > if ( !ns ) return; > > > > rbi = strtoul( p, &p ,10); > >- if ( rbi >= ns->rbi ) { > >- l_bytes_in += rbi - ns->rbi; > >- } else { > >- debug_msg("update_ifdata(%s) - > Overflow in rbi: %lu -> %lu",caller,ns->rbi,rbi); > >- l_bytes_in += ULONG_MAX - ns->rbi + rbi; > >- } > >+ l_bytes_in = > counterdiff(rbi,ns->rbi,ULONG_MAX, 0); > > Shouldn't that be "+= counterdiff/..."? l_bytes_in is > cummulated over all NICs. You right, I can changed to += or follow what you have in *bsd In BSD you cumulate when make the rate (line 1223 of freebsd/metrics.c for example) But definitely I need to add the "+" at some point :-) I will do it where you recommend because is the most un-disturbing place to do it :-) > > ns->rpo = rpo; > > } > > p = index (p, '\n') + 1; // skips a line > >@@ -1305,3 +1285,40 @@ > > val.f = most_full; > > return val; > > } > >+ > >+static unsigned long > >+counterdiff(unsigned long oldval, unsigned long newval, > unsigned long maxval, unsigned long maxdiff) > >+{ > >+ unsigned long diff; > >+ > >+ if (maxdiff == 0) > >+ maxdiff = maxval; > >+ > >+ /* Paranoia */ > >+ if (oldval > maxval || newval > maxval) > >+ return 0; > > Really cannot happen with maxval being ULONG_MAX. Even the > paranoid should feel safe here :-) > > >+ > >+ /* > >+ * Tackle the easy case. Don't worry about maxdiff > here because > >+ * we're SOL if it happens (i.e. assuming a reset just makes > >+ * matters worse).This > >+ */ > >+ if (oldval <= newval) > >+ return (newval - oldval); > >+ > >+ /* > >+ * Now the tricky part. If we assume counters never > get reset, > >+ * this is easy. Unfortunaly, they do get reset on some > >+ * systems, so we need to try and deal with that. > Our huristic > >+ * is that if out difference is greater then maxdiff > and newval > >+ * is less or equal to maxdiff, then we've probably > been reset > >+ * rather then actually wrapping. Obviously, you need to be > >+ * careful to poll often enough that you won't > exceed maxdiff or > >+ * you will get undersized numbers when you do wrap. > >+ */ > >+ diff = maxval - oldval + newval; > >+ if (diff > maxdiff && newval <= maxdiff) > >+ return newval; > > "diff > maxdiff" cannot happen, as maxdiff is ULONG_MAX > > >+ > >+ return diff; > >+} Actually, I didn't add any checkout to the counterdiff function that wasn't there before :-) (in the bsd code ) or are those checks valid in BSD only? But I think this is the best place to place any sanity check we might want to add, right? Actually, now that I saw the BSD code more close, I fail to "copy&paste and addapt" the line where it is defined the counterdiff function BSD: static uint64_t counterdiff(uint64_t, uint64_t, uint64_t, uint64_t); Linux: static unsigned long counterdiff(unsigned long, unsigned long, unsigned long, unsigned long); Is that correct? Thanks Roger PD: I am not subscribed to ganglia-developers so, please, can you forward this email to there? Thanks PD: I am sending a new patch to you in this email, thanks
ganglia-libmetrics_linux_metrics.patch
Description: Binary data
------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers