Hi Chuck,

See below...



Chuck Simmons wrote:

> The number of cpus does get sorted out, but I don't believe that
> restarting 'gmond' is a solution.  The problem occurs after restarting
> a number of 'gmond' processes, and the problem is caused because
> 'gmond' is not reporting the information.  Does 'gmond' maintain a
> timestamp on disk as to when it last reported the number of cpus and
> insist on waiting sufficiently long to report again?  Does the
> collective distributed memory of the system remember when the number
> of cpus was last reported but not remember what the last reported
> value was?  Is there any chance that anyone can give me hints to how
> the code works without me having to read the code and reverse engineer
> the intent?
>

The reporting interval for number of CPUs is defined within /etc/gmond.conf.
For example:

  collection_group {
    collect_once   = yes
    time_threshold = 1800
    metric {
     name = "cpu_num"
    }

The above defines that the number of CPUs is collected once at the
startup of gmond and reported every 1800 seconds.
Your problem occurs because gmond doesn't save any data on disk, but
rather in memory. This means that if you're using a single gmond
aggregator (in unicast mode) and that aggregator gets restarted, it will
will not receive another report the number of CPUs till 1800 seconds
elapsed since the previous report.
The case of multicast is a more interesting one, since every node holds
data for all nodes on the multicast channel. The question here is
whether an update with a newer timestamp overrides all previous XML data
for the host. I don't think that's the case, it seems more likely that
only existing data is overwritten... but then, I don't use multicast, so
you may qualify this answer as throwing useless, obvious crap your way.

Generally speaking, there are 2 cases when a host reports a metric via
its send_channel:
1. When a time_threshold expires.
2. When a value_threshold is exceeded.

You're welcome to read the code for more insight, but a simple telnet to
a predefined TCP channel would probably be quicker. You could just look
at the XML data and compare pre-update and post-update values (yes,
you'll need to take note of the timestamps - again, in the XML).

> I understand that I can group nodes via /etc/gmond.conf.  The question
> is, once I have screwed up the configuration, how do I recover from
> that screw up?  I have restarted various gmetad's and various
> gmond's.  The grouping is still incorrect.  Exactly which gmetad's and
> gmond's do I have to shut down when.  And, again, my real question is
> about understanding how the code works -- how the distributed memory
> works.
>

As far as I know, you cannot recover from a configuration error unless
you've made sure host_dmax was set to a fairly small, non-zero value.

Reply via email to