Hi Chuck,
See below... Chuck Simmons wrote: > The number of cpus does get sorted out, but I don't believe that > restarting 'gmond' is a solution. The problem occurs after restarting > a number of 'gmond' processes, and the problem is caused because > 'gmond' is not reporting the information. Does 'gmond' maintain a > timestamp on disk as to when it last reported the number of cpus and > insist on waiting sufficiently long to report again? Does the > collective distributed memory of the system remember when the number > of cpus was last reported but not remember what the last reported > value was? Is there any chance that anyone can give me hints to how > the code works without me having to read the code and reverse engineer > the intent? > The reporting interval for number of CPUs is defined within /etc/gmond.conf. For example: collection_group { collect_once = yes time_threshold = 1800 metric { name = "cpu_num" } The above defines that the number of CPUs is collected once at the startup of gmond and reported every 1800 seconds. Your problem occurs because gmond doesn't save any data on disk, but rather in memory. This means that if you're using a single gmond aggregator (in unicast mode) and that aggregator gets restarted, it will will not receive another report the number of CPUs till 1800 seconds elapsed since the previous report. The case of multicast is a more interesting one, since every node holds data for all nodes on the multicast channel. The question here is whether an update with a newer timestamp overrides all previous XML data for the host. I don't think that's the case, it seems more likely that only existing data is overwritten... but then, I don't use multicast, so you may qualify this answer as throwing useless, obvious crap your way. Generally speaking, there are 2 cases when a host reports a metric via its send_channel: 1. When a time_threshold expires. 2. When a value_threshold is exceeded. You're welcome to read the code for more insight, but a simple telnet to a predefined TCP channel would probably be quicker. You could just look at the XML data and compare pre-update and post-update values (yes, you'll need to take note of the timestamps - again, in the XML). > I understand that I can group nodes via /etc/gmond.conf. The question > is, once I have screwed up the configuration, how do I recover from > that screw up? I have restarted various gmetad's and various > gmond's. The grouping is still incorrect. Exactly which gmetad's and > gmond's do I have to shut down when. And, again, my real question is > about understanding how the code works -- how the distributed memory > works. > As far as I know, you cannot recover from a configuration error unless you've made sure host_dmax was set to a fairly small, non-zero value.