Re: [Ganglia-general] two identical hosts, one is having trouble with gmond

Michael Bravo Wed, 27 Apr 2011 13:05:45 -0700

OMG, send_metadata_interval wasn't set > 0

that *looks* like the very likely culprit and that missed obvious
thing I was talking about


I will try setting this right and se if it fixes the problem. If not,
I will post gmond.conf, and if yes, I will report back to mark the
problem as solved.

Thank you very much.

On Wed, Apr 27, 2011 at 11:53 PM, Bernard Li <bern...@vanhpc.org> wrote:
> Hi Michael:
>
> Can you please post gmond.conf (post a diff from stock config if it's
> too big or to pastebin.com) of one of the host and of the collector?
>
> Also, did you set send_metadata_interval > 0?
>
> Cheers,
>
> Bernard
>
> On Wed, Apr 27, 2011 at 12:31 PM, Michael Bravo <mike.br...@gmail.com> wrote:
>> I think now I have observed the system for a few hours more, I can
>> generalize a bit, but as to 'df -h' output - it is identical, save for
>> a minimal difference in space actually free/used.
>>
>> However, let me describe the setup in more detail.
>>
>> There are 5 hosts in one datacenter, which comprise the cluster being
>> monitored and run gmond, and one in another, which runs web frontend
>> and gmetad.
>>
>> let's say those are host1-host5 and then host-web.
>>
>> The 5 hosts in question are just idling before being put under
>> production load, and so most of the metrics are near zero.
>>
>> host1 is the collector - the other 4 hosts report via unicast to it.
>> host-web then polls it.
>>
>> host-web(gmetad) <---------> host1 (gmond)
>>                                           ^---------host2 (gmond)
>>                                           ^---------host3 (gmond)
>>                                           ^---------host4 (gmond)
>>                                           ^---------host5 (gmond)
>>
>> something like this
>>
>> Now, during the day in this timezone, while some preproduction work
>> was being done at the hosts1-5, all of them but the problematic host3
>> had all of the default metrics reported and graphed. That was when I
>> first wrote to the list.
>>
>> However, now that it is close to midnight here and most everyone has
>> gone home, I find that the ONLY host that has all of the default
>> metrics is the host1, the collector (which also listens), while others
>> lost everything but up/down state. Like this (physical view):
>>
>>
>> host5
>> Last heartbeat 10s
>> cpu: 0.00G mem: 0.00G
>>
>> host4
>> Last heartbeat 10s
>> cpu: 0.00G mem: 0.00G
>>
>> host3
>> Last heartbeat 1s
>> cpu: 0.00G mem: 0.00G
>>
>> host2
>> Last heartbeat 8s
>> cpu: 1.95G (4) mem: 0.00G
>>
>> host1
>> 0.14
>> Last heartbeat 1s
>> cpu: 1.95G (4) mem: 7.80G
>>
>>
>> So, just out of pure speculation I could attribute this metric loss to
>> all metrics being under value_threshold, but.... what about
>> time_threshold? And why is the collector host holding onto its metrics
>> while others lost theirs but keep the heartbeats?
>>
>> I feel confused, which is probably an indicator that I am missing
>> something obvious...
>>
>> On Wed, Apr 27, 2011 at 9:56 PM, Bernard Li <bern...@vanhpc.org> wrote:
>>> Hi Michael:
>>>
>>> You can try looking at the XML representation of the metric data from
>>> each of your gmonds to figure out what's different between them.  You
>>> can accomplish this by doing:
>>>
>>> nc localhost 8649 (assuming you are using the default gmond port of 8649)
>>>
>>> This should spit out all the metric data of all hosts gmond is aware of.
>>>
>>> What's the output of `df -h` on both systems, do they look different?
>>>
>>> Cheers,
>>>
>>> Bernard
>>>
>>> On Wed, Apr 27, 2011 at 9:16 AM, Michael Bravo <mike.br...@gmail.com> wrote:
>>>> More precisely, some metrics seem to be collected, and periodically
>>>> sent, such as
>>>>
>>>>       metric 'disk_free' being collected now
>>>> Counting device /dev/root (6.21 %)
>>>> For all disks: 142.835 GB total, 133.963 GB free for users.
>>>>        metric 'disk_free' has value_threshold 1.000000
>>>>        metric 'part_max_used' being collected now
>>>> Counting device /dev/root (6.21 %)
>>>> For all disks: 142.835 GB total, 133.963 GB free for users.
>>>>        metric 'part_max_used' has value_threshold 1.000000
>>>>
>>>>
>>>> and then (I think around time_threshold expiration)
>>>>
>>>>        sent message 'disk_free' of length 52 with 0 errors
>>>>        sent message 'part_max_used' of length 52 with 0 errors
>>>>
>>>> also, on startup all of these metrics seem to be prepared correctly:
>>>>
>>>>       sending metadata for metric: disk_free
>>>>        sent message 'disk_free' of length 52 with 0 errors
>>>>        sending metadata for metric: part_max_used
>>>>        sent message 'part_max_used' of length 52 with 0 error
>>>>
>>>> etc and so on
>>>>
>>>> but none of these metrics appear in the node report at the web
>>>> frontend, as I listed in original message
>>>>
>>>> where does the "Local disk: unknown" part coming from then?
>>>>
>>>> what is the most baffling, is that this problem host is completely
>>>> identical to the one next to it, which has zero problems
>>>>
>>>> On Wed, Apr 27, 2011 at 7:30 PM, Michael Bravo <mike.br...@gmail.com> 
>>>> wrote:
>>>>> I did try that, in non-daemonized mode, however there weren't any
>>>>> evident errors popping up (and there's a lot of information coming up
>>>>> that way), so perhaps I need an idea what to look for.
>>>>>
>>>>> On Wed, Apr 27, 2011 at 7:24 PM, Ron Cavallo <ron_cava...@s5a.com> wrote:
>>>>>> Have you tried stating up gmond on the effected server with debug set to
>>>>>> 10 in the gmond.conf? This may show some of the collection problems its
>>>>>> having more specifically....
>>>>>>
>>>>>> -RC
>>>>>>
>>>>>>
>>>>>> Ron Cavallo
>>>>>> Sr. Director, Infrastructure
>>>>>> Saks Fifth Avenue / Saks Direct
>>>>>> 12 East 49th Street
>>>>>> New York, NY 10017
>>>>>> 212-451-3807 (O)
>>>>>> 212-940-5079 (fax)
>>>>>> 646-315-0119(C)
>>>>>> www.saks.com
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Michael Bravo [mailto:mike.br...@gmail.com]
>>>>>> Sent: Wednesday, April 27, 2011 11:14 AM
>>>>>> To: ganglia-general
>>>>>> Subject: [Ganglia-general] two identical hosts,one is having trouble
>>>>>> with gmond
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> here is a strange occurence. I have two (infact, more than two, but
>>>>>> let's consider just a pair) identical servers running identical setups
>>>>>> - identical OS, identical gmond with identical config files, identical
>>>>>> disks, identical everything. However, one of those servers is perfectly
>>>>>> well, and another one has trouble reporting default metrics.
>>>>>>
>>>>>> Here's what the "normal" one shows in node view:
>>>>>>
>>>>>> xx.xx.xx.172
>>>>>>
>>>>>> Location: Unknown
>>>>>> Cluster local time Wed Apr 27 19:05:32 2011 Last heartbeat received 5
>>>>>> seconds ago.
>>>>>> Uptime 9 days, 9:22:38
>>>>>> Load:   0.00    0.00    0.00
>>>>>> 1m      5m      15m
>>>>>>
>>>>>> CPU Utilization:        0.1     0.2     99.7
>>>>>> user    sys     idle
>>>>>> Hardware
>>>>>> CPUs: 4 x 1.95 GHz
>>>>>> Memory (RAM): 7.80 GB
>>>>>> Local Disk: Using 16.532 of 142.835 GB
>>>>>> Most Full Disk Partition: 11.6% used.   Software
>>>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64)
>>>>>> Booted: April 18, 2011, 9:42 am
>>>>>> Uptime: 9 days, 9:22:38
>>>>>> Swap: Using 0.0 of 12001.6 MB swap.
>>>>>>
>>>>>>
>>>>>> and here's what the "problem one" shows:
>>>>>>
>>>>>> xx.xx.xx.171
>>>>>>
>>>>>> Location: Unknown
>>>>>> Cluster local time Wed Apr 27 19:07:32 2011 Last heartbeat received 10
>>>>>> seconds ago.
>>>>>> Uptime 9 days, 9:20:01
>>>>>> Load:   0.00    0.00    0.00
>>>>>> 1m      5m      15m
>>>>>>
>>>>>> CPU Utilization:        0.1     0.2     99.7
>>>>>> user    sys     idle
>>>>>> Hardware
>>>>>> CPUs: 4 x 1.95 GHz
>>>>>> Memory (RAM): 7.80 GB
>>>>>> Local Disk: Unknown
>>>>>> Most Full Disk Partition: 6.2% used.    Software
>>>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64)
>>>>>> Booted: April 18, 2011, 9:47 am
>>>>>> Uptime: 9 days, 9:20:01
>>>>>> Swap: Using 12001.6 of 12001.6 MB swap.
>>>>>>
>>>>>>
>>>>>>
>>>>>> both are running gmond 3.1.7 and talk to a third host which also runs
>>>>>> gmond 3.1.7 (which is getting polled by the web frontend host with
>>>>>> gmetad 3.1.7)
>>>>>>
>>>>>> at a glance, there's something confusing gmond on the problem server, so
>>>>>> it mismatches disk partitions, or something.
>>>>>>
>>>>>> as a result, the problem node reports not all of the default metrics,
>>>>>> and those it does are somewhat off-kilter, as you can see (unknown local
>>>>>> disk?)
>>>>>>
>>>>>> Any idea what might be going wrong and/or how to pinpoint the problem?
>>>>>>
>>>>>> --
>>>>>> Michael Bravo
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>> ------
>>>>>> WhatsUp Gold - Download Free Network Management Software The most
>>>>>> intuitive, comprehensive, and cost-effective network management toolset
>>>>>> available today.  Delivers lowest initial acquisition cost and overall
>>>>>> TCO of any competing solution.
>>>>>> http://p.sf.net/sfu/whatsupgold-sd
>>>>>> _______________________________________________
>>>>>> Ganglia-general mailing list
>>>>>> Ganglia-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> WhatsUp Gold - Download Free Network Management Software
>>>> The most intuitive, comprehensive, and cost-effective network
>>>> management toolset available today.  Delivers lowest initial
>>>> acquisition cost and overall TCO of any competing solution.
>>>> http://p.sf.net/sfu/whatsupgold-sd
>>>> _______________________________________________
>>>> Ganglia-general mailing list
>>>> Ganglia-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>>>
>>>
>>
>

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] two identical hosts, one is having trouble with gmond

Reply via email to