>>> On 9/18/2008 at  1:37 AM, in message
<[EMAIL PROTECTED] 
, <[EMAIL PROTECTED]> wrote:

> 
>> -----Original Message-----
>> From: [EMAIL PROTECTED] 
>> [mailto:[EMAIL PROTECTED] On 
>> Behalf Of Brad Nicholes
>> Sent: 17 September 2008 17:15
>> To: Pocock, Daniel: IT (LDN); ganglia-developers@lists.sourceforge.net 
>> Subject: Re: [Ganglia-developers] metrics-per-host?
>> 
>> 
>>  >>> On 9/17/2008 at  9:13 AM, in message 
>> <[EMAIL PROTECTED] 
>> ANET.BARCAPINT.COM
>> 
>> , <[EMAIL PROTECTED]> wrote:
>> 
>> > 
>> >> -----Original Message-----
>> >> From: Brad Nicholes [mailto:[EMAIL PROTECTED] 
>> >> Sent: 17 September 2008 15:53
>> >> To: Pocock, Daniel: IT (LDN);
>> ganglia-developers@lists.sourceforge.net 
>> >> Subject: RE: [Ganglia-developers] metrics-per-host?
>> >> 
>> >> 
>> >>  >>> On 9/17/2008 at 8:23 AM, in message 
>> >> <[EMAIL PROTECTED] 
>> >> ANET.BARCAPINT.COM
>> >> 
>> >> , <[EMAIL PROTECTED]> wrote:
>> >> 
>> >> >> If you run gmond with -d 10, the debug output will show you 
>> >> >> everything that gmond sent as well as everything that
>> >> gmond received.  
>> >> >> If you capture that output and then do some analysis on
>> >> it, is gmond
>> >> >> sending all of the metrics?  If so, then is it also receiving
>> back 
>> >> >> everything that it send?  If
>> >> > 
>> >> > 
>> >> > 
>> >> > tcpdump shows me that the packets are being transmitted on the
>> >> loopback
>> >> > interface
>> >> > 
>> >> > In the debug output, I see messages like this for all 400
>> metrics:
>> >> > 
>> >> >         metric 'test0000399' being collected now
>> >> >         metric 'test0000399' has value_threshold 1.000000
>> >> >         sending metadata for metric: test0000399
>> >> >         sent message 'test0000399' of length 76 with 0 errors
>> >> > 
>> >> > 
>> >> > However, I see this message for some metrics and not others:
>> >> > 
>> >> > ***Allocating metadata packet for host--localhost.localdomain--
>> and 
>> >> > metric --test0000029-- **** saving metadata for metric:
>> test0000029 
>> >> > host: localhost.localdomain ***Allocating value packet for
>> >> > host--localhost.localdomain-- and
>> >> metric
>> >> > --test0000029-- ****
>> >> > 
>> >> > Within 1-2 seconds of starting gmond, tcpdump reports 943 packets
>> >> sent
>> >> > on the loopback interface - that appears to include 400 metadata
>> >> packets
>> >> > and 400 data packets for my test metric
>> >> > 
>> >> 
>> >> It is actually the value packets that I am more interested 
>> in rather 
>> >> than the metadata packets.  You will probably have to let 
>> gmond run 
>> >> for a few minutes to allow it to sync up on all of the metadata 
>> >> packets and just start processing value packets.  At that 
>> point you 
>> >> should see x number of value packets sent and x number of value 
>> >> packets received.
>> > 
>> > 
>> > It has been running for about 30 minutes now.  All the debug output
>> has
>> > been sent to a file.
>> > 
>> > If I grep for `value packet' in the file, I don't find value packets
>> for
>> > all of the metrics.
>> > 
>> > Given that it is operating on the loopback interface, packet loss 
>> > shouldn't be an issue - any other things I should check?
>> 
>> When you say that you don't see value packets for all of the 
>> metrics is that sent and/or received or just received?  In 
>> other words are value packets being sent for all of the 
>> metrics but not received or are there some metric value 
>> packets that just aren't being sent?
> 
> They are sent for all metrics, but they don't appear to be received for
> all metrics

This indicates to me that the most likely place to start looking for the 
problem is in the function process_udp_recv_channel() in gmond.c.  The first 
thing to look at would be any failures that occur in this function.  
Unfortunately this function doesn't report many of the failures through debug 
messages.  So the first thing to do would be to add either more debug message 
or just printf() statements.  If you just add printf() statement, then you 
should be able to run gmond with -d1 to see the printf() messages while 
eliminating all of the rest of the -d10 debug messages.  Of course the printf() 
statements would just be temporary for debug purposes.  

What you would be looking for are any failures in the function.  If you don't 
see any failures, then we have to start looking at the apr_poll() functions 
which is going to get a little more tricky.  If there are failures then we just 
need to resolve them.

I would go ahead and look into this myself but I am completely swamped at the 
moment and won't have any time available in the near future.  So if you would 
like to take this Daniel, or anybody else on the list wants it, feel free.  It 
actually sounds like a fun challenge and it would be great to get more people 
familiar with the guts of gmond anyway.

Brad

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to