Hi Michael,

 I guess the POWER5 extensions would be good candidates for dynamic
loading into the gmond stream. In any case, I see no reason not to keep
them in the core code, even if they are not enabled by default.

 One thing that I like more with the current code are the "combined"
functions for retrieving related metrics (get all cpu and network
stats) at one point in time. The reduce syscall overhead and keep
metrics together (important for CPU usage).

Cheers
Martin
--- Michael Perzl <[EMAIL PROTECTED]> wrote:

> Hi Martin,
> 
> if possible I would like to somehow take my version (after some 
> reviewing) :-)    , as it contains all the new POWER5 stuff already.
> 
> My understanding is - as it would require some changes to protocol.x
> - 
> that my changes won't have a chance to get into the core Ganglia
> source 
> code until version 3.1 comes along.
> 
> This code and everything else (RPMs) can be found on my website 
> http://www.perzl.org/ganglia/.
> 
> This stuff is actually in use at quite many customer sites already
> (runs 
> on AIX 4.3.3, 5.1, 5.2 and 5.3) so I would like to keep that 
> POWER5-stuff in if possible. Actually, an AIX gmond implementation 
> without the POWER5-stuff based on my implementation could be done
> very 
> easy (just stripping off the POWER5-addons).
> 
> Regards,
> Michael
> 
> Martin Knoblauch wrote:
> > Michael, Andreas,
> >
> >  any chance that you could consolidate the two versions of the AIX
> > metrics that seem to be around? Seem you are the ones who have
> worked
> > most recently on the AIX implementation.
> >
> > Cheers
> > Martin
> >
> > --- Michael Perzl <[EMAIL PROTECTED]> wrote:
> >
> >   
> >> Andreas,
> >>
> >> thank you for taking the blame but you are off the hook here.  ;-)
> >>
> >> If I understood David correctly, he is using my AIX Ganglia RPM
> >> packages 
> >> with POWER5 extensions. Here most if not all implementation of how
> >> the 
> >> metrics are collected under AIX have been changed. Everything is 
> >> documented on my homepage (http://www.perzl.org/ganglia/) though.
> >> So everything what goes wrong here is entiremy my fault :-[
> >>
> >> After some investigating and some discussions with Nigel I have
> come
> >> to 
> >> terms with the following facts regarding the bytes_in/bytes_out
> >> problem:
> >> - libperfstat (the library on AIX which obtains all the system 
> >> performance data) uses u_longlong_t data types (these are
> definitely 
> >> 64-bit large).
> >> - The AIX kernel internally, though, may probably not be using
> 64-bit
> >>
> >> data types - more realistic is probably unsigned 32-bit - in order
> >> not 
> >> to break compatibility (my personal opinion)
> >> - The consequence now is that integer overrun may occur much
> easier
> >> with 
> >> 32-bit data types than with 64-bit data types (we all probably
> don't 
> >> live long enough to see that happen).
> >>
> >> Please take a look at my implementation of the bytes_in metric
> (the 
> >> bytes_out implementation is accordingly):
> >>
> >> 01  g_val_t
> >> 02  bytes_in_func( void )
> >> 03  {
> >> 04     g_val_t val;
> >> 05     perfstat_netinterface_total_t n;
> >> 06     static u_longlong_t last_bytes_in = 0, bytes_in;
> >> 07     static double last_time = 0.0;
> >> 08     double now, delta_t;
> >> 09     struct timeval timeValue;
> >> 10     struct timezone timeZone;
> >> 11
> >> 12     gettimeofday( &timeValue, &timeZone );
> >> 13
> >> 14     now = (double) (timeValue.tv_sec - boottime) +
> >> (timeValue.tv_usec 
> >> / 1000000.0);
> >> 15
> >> 16     if (perfstat_netinterface_total( NULL, &n, sizeof( 
> >> perfstat_netinterface_total_t ), 1 ) == -1)
> >> 17        val.f = 0.0;
> >> 18     else
> >> 19     {
> >> 20        bytes_in = n.ibytes;
> >> 21
> >> 22        delta_t = now - last_time;
> >> 23
> >> 24        if ( delta_t )
> >> 25           val.f = (double) (bytes_in - last_bytes_in) /
> delta_t;
> >> 26        else
> >> 27           val.f = 0.0;
> >> 28
> >> 29        last_bytes_in = bytes_in;
> >> 30     }
> >> 31
> >> 32     last_time = now;
> >> 33
> >> 34     return( val );
> >> 35  }
> >>
> >> In my opinion the overrun occurs in line #25 when "bytes_in < 
> >> last_bytes_in".
> >> In my naivity I had assumed as both are of type u_longlong_t that
> an 
> >> integer overrun might never happen.
> >>
> >> So to solve the overrun a check for "bytes_in < last_bytes_in"
> must
> >> be 
> >> introduced, something like:
> >>
> >> u_longlong_t d;
> >> d = bytes_in - last_bytes_in;
> >> if (d < 0) d += ULONG_MAX;
> >>
> >> and line #25 would essentially become
> >> 25           val.f = (double) d / delta_t;
> >>
> >> Comments ?
> >>
> >> Regards,
> >> Michael
> >>
> >> PS: David, the reason why you don't see it happen with pkts_in and
> 
> >> pkts_out is that probably no overrun so far has occurred but at
> some 
> >> point it will also happen.
> >>
> >> PPS: David, if this is a solution (I want some comments on that
> >> before, 
> >> though) then I would be building new RPMs with the then hopefully 
> >> correct code.
> >>
> >> Andreas Schoenfeld wrote:
> >>     
> >>> Hi David and Martin,
> >>>
> >>> I suppose the network code is still the code I wrote, so there
> are
> >>>       
> >> two
> >>     
> >>> problems  I know of:
> >>> 1. yes there is a problem with owerflows
> >>> 2. the shown network traffic is the sum of all network interfaces
> >>> including local loopback devices (lo0...).
> >>>
> >>> Both Problems could lead to astonishing data transfer rate in
> >>>       
> >> ganglia.
> >>     
> >>> Sorry I had promised to fix the problems, but there was to much
> >>>       
> >> other
> >>     
> >>> work ...
> >>>
> >>> Best regards
> >>>    Andreas
> >>>
> >>>   
> >>>       
> >>>> Date: Thu, 29 Mar 2007 08:21:38 -0700 (PDT)
> >>>> From: Martin Knoblauch <[EMAIL PROTECTED]>
> >>>> Subject: Re: [Ganglia-general] Help! I have a petabyte/s network
> >>>> To: David Wong <[EMAIL PROTECTED]>,
> >>>>         
> >> [EMAIL PROTECTED],
> >>     
> >>>>  [EMAIL PROTECTED]
> >>>> Message-ID: <[EMAIL PROTECTED]>
> >>>> Content-Type: text/plain; charset=iso-8859-1
> >>>>
> >>>> David,
> >>>>
> >>>>  good catch. I will have to look at it for a bit.
> >>>>
> >>>> Cheers
> >>>> Martin
> >>>> --- David Wong <[EMAIL PROTECTED]> wrote:
> >>>>
> >>>>     
> >>>>         
> >>>>>> I don't write much code nowadays, so I'm going to need a lot
> of
> >>>>>>             
> >> help
> >>     
> >>>>>> with this.
> >>>>>>
> >>>>>> I dug through the ganglia code and I found this interesting
> >>>>>>             
> >> tidbit in
> >>     
> >>>>>> libmetrics/aix/metrics.c which may be indicative of the
> problem.
> >>>>>>
> >>>>>> There's an assignment from cur_ninfo.ibytes to
> >>>>>>             
> >> cur_net_stat.ibytes,
> >>     
> >>>>>> but
> >>>>>> the types of the two variables are different.
> >>>>>>
> >>>>>> net_stat::ibytes is a double: 
> >>>>>>
> >>>>>> struct net_stat{
> >>>>>>   double ipackets;
> >>>>>>   double opackets;
> >>>>>>   double ibytes;
> >>>>>>   double obytes;
> >>>>>> } cur_net_stat;
> >>>>>>
> >>>>>> and we have *ninfo declared here:
> >>>>>>
> >>>>>> perfstat_netinterface_total_t ninfo[2],*last_ninfo, *cur_ninfo
> ;
> >>>>>>
> >>>>>> libperfstat.h has perfstat_netinterface_total_t::ibytes as
> >>>>>> u_longlong_t.
> >>>>>>
> >>>>>> Does this code try to do what I think it is doing, i.e. assign
> >>>>>>             
> >> an
> >>     
> >>>>>> unsigned 64 bit integer to a signed 64bit integer?
> >>>>>>
> >>>>>> I'm willing to test the code if someone who's more adept at
> >>>>>>             
> >> coding
> >>     
> >>>>>> and
> >>>>>> building will take on the challenge.
> >>>>>>
> >>>>>> It looks to me that the type mismatch will have to fixed in a
> >>>>>>             
> >> few
> >>     
> >>>>>> places, such as CALC_NETSTAT, and we'll have to add an
> unsigned
> >>>>>>             
> >> long
> >>     
> >>>>>> long to g_val_t too.  Those are the ones I can see so far.
> >>>>>>
> >>>>>> David Wong
> >>>>>> Senior Systems Engineer
> >>>>>> Management Dynamics, Inc.
> >>>>>> Phone: 201-804-6127
> >>>>>> [EMAIL PROTECTED]
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Martin Knoblauch [mailto:[EMAIL PROTECTED] 
> >>>>>> Sent: Wednesday, March 28, 2007 12:00 PM
> >>>>>> To: David Wong; [EMAIL PROTECTED]
> >>>>>> Subject: Re: [Ganglia-general] Help! I have a petabyte/s
> network
> >>>>>>
> >>>>>> David,
> >>>>>>
> >>>>>>  as far as I remember, the AIX metrics code had an
> >>>>>> overflow/wrap-around
> >>>>>> problem prior to 3.0.4. Maybe the fixes are not thorough
> enough.
> >>>>>>
> >>>>>>  The packets/sec are of course less affected.
> >>>>>>
> >>>>>> Cheers
> >>>>>> Martin
> >>>>>>         
> >>>>>>             
> >>>   
> >>>       
> >
> >
> > ------------------------------------------------------
> > Martin Knoblauch
> > email: k n o b i AT knobisoft DOT de
> > www:   http://www.knobisoft.de
> >
> >   
> 


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

Reply via email to