Brad Nicholes <[EMAIL PROTECTED]> wrote: > I am wondering if this might be an issue with the way that the > metadata for a metric is being sent. The unique attribute about > this is that cpu_num is a collect_once metric. This means that if > the data value is sent but one of the gmond's in the cluster has not > received the metadata yet, the value may get ignored when the XML is > written. One interesting test to try to validate this theory would > be to set the send_metadata_interval to something greater than zero > even in a multicast environment. Then run your same test and see if > the same problem shows up or goes away. If the problem goes away, > then we might have to rework how the metadata data is being > requested and sent in a multicast environment.
I tried the experiment, and it still failed. Start with a cluster that's been up for days, except web8 recently rebooted: >./forhosts 'nc localhost 8649 | grep cpu_num | wc -l' `sl group web-gen:` web5 6 web8 1 web9 7 web10 1 web11 8 web12 8 web13 8 Now, I changed all their gmond.conf files to send_metadata_interval = 10 I restarted the gmond's on web9, web10, and web11 If this were the solution, you'd expect all nodes to quickly get at least those three nodes' cpu_num metrics. After three minutes: >./forhosts 'nc localhost 8649 | grep cpu_num | wc -l' `sl group web-gen:` web5 7 web8 1 web9 1 web10 0 web11 2 web12 8 web13 8 After ten more minutes: web5 7 web8 1 web9 3 web10 1 web11 4 web12 8 web13 8 web8 and web10 clearly still haven't seen all of them yet. Now, restart gmond on web5 and web12 (so all but web8 and web13 have the new setting) and wait five minutes... web5 6 web8 1 web9 6 web10 1 web11 6 web12 5 web13 8 Now, restart the entire cluster in order by hostname. After one minute: web5 7 web8 1 web9 5 web10 1 web11 3 web12 2 web13 1 After ten minutes: web5 7 web8 1 web9 5 web10 1 web11 3 web12 2 web13 1 Something is weird in web8 and web10 that I'll have to investigate. As for the rest, they show exactly the expected pattern from before I set the metadata interval: each one sees itself and the ones that started after it, in a descending pattern, 7 (6) 5 (4) 3 2 1. -- Cos ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

