OMG, send_metadata_interval wasn't set > 0 that *looks* like the very likely culprit and that missed obvious thing I was talking about
I will try setting this right and se if it fixes the problem. If not, I will post gmond.conf, and if yes, I will report back to mark the problem as solved. Thank you very much. On Wed, Apr 27, 2011 at 11:53 PM, Bernard Li <bern...@vanhpc.org> wrote: > Hi Michael: > > Can you please post gmond.conf (post a diff from stock config if it's > too big or to pastebin.com) of one of the host and of the collector? > > Also, did you set send_metadata_interval > 0? > > Cheers, > > Bernard > > On Wed, Apr 27, 2011 at 12:31 PM, Michael Bravo <mike.br...@gmail.com> wrote: >> I think now I have observed the system for a few hours more, I can >> generalize a bit, but as to 'df -h' output - it is identical, save for >> a minimal difference in space actually free/used. >> >> However, let me describe the setup in more detail. >> >> There are 5 hosts in one datacenter, which comprise the cluster being >> monitored and run gmond, and one in another, which runs web frontend >> and gmetad. >> >> let's say those are host1-host5 and then host-web. >> >> The 5 hosts in question are just idling before being put under >> production load, and so most of the metrics are near zero. >> >> host1 is the collector - the other 4 hosts report via unicast to it. >> host-web then polls it. >> >> host-web(gmetad) <---------> host1 (gmond) >> ^---------host2 (gmond) >> ^---------host3 (gmond) >> ^---------host4 (gmond) >> ^---------host5 (gmond) >> >> something like this >> >> Now, during the day in this timezone, while some preproduction work >> was being done at the hosts1-5, all of them but the problematic host3 >> had all of the default metrics reported and graphed. That was when I >> first wrote to the list. >> >> However, now that it is close to midnight here and most everyone has >> gone home, I find that the ONLY host that has all of the default >> metrics is the host1, the collector (which also listens), while others >> lost everything but up/down state. Like this (physical view): >> >> >> host5 >> Last heartbeat 10s >> cpu: 0.00G mem: 0.00G >> >> host4 >> Last heartbeat 10s >> cpu: 0.00G mem: 0.00G >> >> host3 >> Last heartbeat 1s >> cpu: 0.00G mem: 0.00G >> >> host2 >> Last heartbeat 8s >> cpu: 1.95G (4) mem: 0.00G >> >> host1 >> 0.14 >> Last heartbeat 1s >> cpu: 1.95G (4) mem: 7.80G >> >> >> So, just out of pure speculation I could attribute this metric loss to >> all metrics being under value_threshold, but.... what about >> time_threshold? And why is the collector host holding onto its metrics >> while others lost theirs but keep the heartbeats? >> >> I feel confused, which is probably an indicator that I am missing >> something obvious... >> >> On Wed, Apr 27, 2011 at 9:56 PM, Bernard Li <bern...@vanhpc.org> wrote: >>> Hi Michael: >>> >>> You can try looking at the XML representation of the metric data from >>> each of your gmonds to figure out what's different between them. You >>> can accomplish this by doing: >>> >>> nc localhost 8649 (assuming you are using the default gmond port of 8649) >>> >>> This should spit out all the metric data of all hosts gmond is aware of. >>> >>> What's the output of `df -h` on both systems, do they look different? >>> >>> Cheers, >>> >>> Bernard >>> >>> On Wed, Apr 27, 2011 at 9:16 AM, Michael Bravo <mike.br...@gmail.com> wrote: >>>> More precisely, some metrics seem to be collected, and periodically >>>> sent, such as >>>> >>>> metric 'disk_free' being collected now >>>> Counting device /dev/root (6.21 %) >>>> For all disks: 142.835 GB total, 133.963 GB free for users. >>>> metric 'disk_free' has value_threshold 1.000000 >>>> metric 'part_max_used' being collected now >>>> Counting device /dev/root (6.21 %) >>>> For all disks: 142.835 GB total, 133.963 GB free for users. >>>> metric 'part_max_used' has value_threshold 1.000000 >>>> >>>> >>>> and then (I think around time_threshold expiration) >>>> >>>> sent message 'disk_free' of length 52 with 0 errors >>>> sent message 'part_max_used' of length 52 with 0 errors >>>> >>>> also, on startup all of these metrics seem to be prepared correctly: >>>> >>>> sending metadata for metric: disk_free >>>> sent message 'disk_free' of length 52 with 0 errors >>>> sending metadata for metric: part_max_used >>>> sent message 'part_max_used' of length 52 with 0 error >>>> >>>> etc and so on >>>> >>>> but none of these metrics appear in the node report at the web >>>> frontend, as I listed in original message >>>> >>>> where does the "Local disk: unknown" part coming from then? >>>> >>>> what is the most baffling, is that this problem host is completely >>>> identical to the one next to it, which has zero problems >>>> >>>> On Wed, Apr 27, 2011 at 7:30 PM, Michael Bravo <mike.br...@gmail.com> >>>> wrote: >>>>> I did try that, in non-daemonized mode, however there weren't any >>>>> evident errors popping up (and there's a lot of information coming up >>>>> that way), so perhaps I need an idea what to look for. >>>>> >>>>> On Wed, Apr 27, 2011 at 7:24 PM, Ron Cavallo <ron_cava...@s5a.com> wrote: >>>>>> Have you tried stating up gmond on the effected server with debug set to >>>>>> 10 in the gmond.conf? This may show some of the collection problems its >>>>>> having more specifically.... >>>>>> >>>>>> -RC >>>>>> >>>>>> >>>>>> Ron Cavallo >>>>>> Sr. Director, Infrastructure >>>>>> Saks Fifth Avenue / Saks Direct >>>>>> 12 East 49th Street >>>>>> New York, NY 10017 >>>>>> 212-451-3807 (O) >>>>>> 212-940-5079 (fax) >>>>>> 646-315-0119(C) >>>>>> www.saks.com >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Michael Bravo [mailto:mike.br...@gmail.com] >>>>>> Sent: Wednesday, April 27, 2011 11:14 AM >>>>>> To: ganglia-general >>>>>> Subject: [Ganglia-general] two identical hosts,one is having trouble >>>>>> with gmond >>>>>> >>>>>> Hello, >>>>>> >>>>>> here is a strange occurence. I have two (infact, more than two, but >>>>>> let's consider just a pair) identical servers running identical setups >>>>>> - identical OS, identical gmond with identical config files, identical >>>>>> disks, identical everything. However, one of those servers is perfectly >>>>>> well, and another one has trouble reporting default metrics. >>>>>> >>>>>> Here's what the "normal" one shows in node view: >>>>>> >>>>>> xx.xx.xx.172 >>>>>> >>>>>> Location: Unknown >>>>>> Cluster local time Wed Apr 27 19:05:32 2011 Last heartbeat received 5 >>>>>> seconds ago. >>>>>> Uptime 9 days, 9:22:38 >>>>>> Load: 0.00 0.00 0.00 >>>>>> 1m 5m 15m >>>>>> >>>>>> CPU Utilization: 0.1 0.2 99.7 >>>>>> user sys idle >>>>>> Hardware >>>>>> CPUs: 4 x 1.95 GHz >>>>>> Memory (RAM): 7.80 GB >>>>>> Local Disk: Using 16.532 of 142.835 GB >>>>>> Most Full Disk Partition: 11.6% used. Software >>>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64) >>>>>> Booted: April 18, 2011, 9:42 am >>>>>> Uptime: 9 days, 9:22:38 >>>>>> Swap: Using 0.0 of 12001.6 MB swap. >>>>>> >>>>>> >>>>>> and here's what the "problem one" shows: >>>>>> >>>>>> xx.xx.xx.171 >>>>>> >>>>>> Location: Unknown >>>>>> Cluster local time Wed Apr 27 19:07:32 2011 Last heartbeat received 10 >>>>>> seconds ago. >>>>>> Uptime 9 days, 9:20:01 >>>>>> Load: 0.00 0.00 0.00 >>>>>> 1m 5m 15m >>>>>> >>>>>> CPU Utilization: 0.1 0.2 99.7 >>>>>> user sys idle >>>>>> Hardware >>>>>> CPUs: 4 x 1.95 GHz >>>>>> Memory (RAM): 7.80 GB >>>>>> Local Disk: Unknown >>>>>> Most Full Disk Partition: 6.2% used. Software >>>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64) >>>>>> Booted: April 18, 2011, 9:47 am >>>>>> Uptime: 9 days, 9:20:01 >>>>>> Swap: Using 12001.6 of 12001.6 MB swap. >>>>>> >>>>>> >>>>>> >>>>>> both are running gmond 3.1.7 and talk to a third host which also runs >>>>>> gmond 3.1.7 (which is getting polled by the web frontend host with >>>>>> gmetad 3.1.7) >>>>>> >>>>>> at a glance, there's something confusing gmond on the problem server, so >>>>>> it mismatches disk partitions, or something. >>>>>> >>>>>> as a result, the problem node reports not all of the default metrics, >>>>>> and those it does are somewhat off-kilter, as you can see (unknown local >>>>>> disk?) >>>>>> >>>>>> Any idea what might be going wrong and/or how to pinpoint the problem? >>>>>> >>>>>> -- >>>>>> Michael Bravo >>>>>> >>>>>> ------------------------------------------------------------------------ >>>>>> ------ >>>>>> WhatsUp Gold - Download Free Network Management Software The most >>>>>> intuitive, comprehensive, and cost-effective network management toolset >>>>>> available today. Delivers lowest initial acquisition cost and overall >>>>>> TCO of any competing solution. >>>>>> http://p.sf.net/sfu/whatsupgold-sd >>>>>> _______________________________________________ >>>>>> Ganglia-general mailing list >>>>>> Ganglia-general@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general >>>>>> >>>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> WhatsUp Gold - Download Free Network Management Software >>>> The most intuitive, comprehensive, and cost-effective network >>>> management toolset available today. Delivers lowest initial >>>> acquisition cost and overall TCO of any competing solution. >>>> http://p.sf.net/sfu/whatsupgold-sd >>>> _______________________________________________ >>>> Ganglia-general mailing list >>>> Ganglia-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general >>>> >>> >> > ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general