On Thu, 2006-03-23 at 15:47 -0800, Chuck Simmons wrote:
> Alex --
> 
> Thanks for the details.  Telneting to a gmond XML port to dump
> internal state is a nice debugging technique.
> 
> One of my problems is that I'm running a secondary daemon using the
> gmetric subroutine libraries, and it took me awhile to realize that
> daemon is in some ways equivalent to 'gmond'.  In particular, I have
> to reboot it in addition to 'gmond'.  The problem was immediately
> obvious once I used the telnet trick you mentioned.

Metrics also have a dmax attribute that should force their removal from
memory once expired, but I don't remember if this is actually
implemented or not.

> So for the missing cpu data issue...  Let me write down what's
> happening real slowly to make sure I understand.  I'm running a
> multicast gmond on each cluster to aggregate data, implying that each
> node of the cluster eventually aggregates data about all other nodes
> of the same cluster.  I'm using a centralized gmetad to pull data from
> a node of each cluster.  Presumably 'gmetad' doesn't really remember a
> whole lot about the outlying nodes.

I am not really sure what you mean here, but gmetad basically keeps info
about all nodes in each cluster in memory, similar to how gmond keeps
info about all nodes in its cluster in memory.  Just like gmond, gmetad
also respects the dmax attributes.  If you don't have dmax set or don't
want to wait that long then you will have to restart gmetad also.

>     I go out to the cluster and kill gmond on each node.  Then I go
> through the nodes and start gmond back up on each node.  As each node
> starts, it broadcasts number of cpus throughout the cluster.  Thus,
> when I'm done restarting, one of the nodes (the first to restart)
> knows how many cpus each node has, but nodes that were restarted last
> don't have complete state information.

Not exactly true, see below.

> When I then restart 'gmetad' at the central location, it connects up
> to one of the nodes in the cluster, and if that node doesn't have full
> state informatin, gmetad incorrectly reports the number of cpus in the
> cluster.  [Since I am using a background process that gathers metrics
> separately from 'gmond' relatively frequently, this background process
> is probably causing all nodes in the cluster to know about all of the
> hosts in the cluster if not all of the metrics of all of the hosts in
> the cluster.]
>     This will eventually correct itself since all metrics are
> periodically rebroadcast.
>     Possible alternate fixes may include:
>         (1) When a node receives a broadcast from another node that it
> hasn't seen before, it may want to send its data back to the first
> node.  If I start node A and it broadcasts to an empty cluster, then I
> start node B and it broadcasts to A, then it might be nice if node A
> sends data back to B because it can reasonably infer that B doesn't
> have A's state and that B should have A's state.

I haven't checked the gmond sources lately, but this is exactly what it
was designed to do.  Anytime gmond sees data from a new node that is
hasn't seen before, it assumes that node doesn't know anything about
itself either, and sends a complete set of its own metrics out on the
multicast address.  This can actually cause part of the problem,
especially if you restart gmond on a lot of nodes all at the same time,
basically because multicast is udp based and therefore does not have
guaranteed packet delivery.  I think during this burst of udp metrics
from many nodes, some get lost and you will just have to wait till they
are resent later.

>         (2) maybe daemons that gather metrics should not directly
> broadcast them throughout a cluster.  Instead the metrics should be
> accumulated within a central daemon and then be broadcast.  (In other
> words, treat 'gmond' as having two separate components:  a metrics
> gathering component and a metric/cluster aggregation component.  Then
> both the metrics component of 'gmond' and the metrics that I am
> gathering should be handed to the aggregation component.)  [This is
> probably not useful without also implementing (1) above.]
>         (3)  Alex implies that there may be alternate ways to
> configure a cluster without using multicasting which may handle some
> or all aspects of this problem.

You can configure gmond to use unicast if you don't need or care about
the HA feature that multicast gives you.

>        [We can treat each node as maintaining a list of metrics and
> their current values and broadcasting deltas to that list on a
> periodic basis.  In the current system, it is possible to recieve a
> delta without having the background data to which the delta applies.
> Multiple daemons each spitting out deltas to their own metrics is
> compatible with the current model.  However, we may want to have all
> the background data in a single list; we may also want each node to
> know which metric gathering daemons exist so that we can better report
> when one of the metric gathering daemons dies.]
> 
> Moving on to the issue of correcting configuration problems.  While we
> can say that having a timeout is the way to correct configuration
> issues, this is not necessarily the best implementation.  Part of my
> problem is that I have multiple daemons that gather and broadcast
> metrics.  If we address parts of that as discussed above, then it
> becomes easier to fix the broadcast address by just resetting a single
> daemon.

There was a plan to provide a plugin architecture for writing custom
metrics in ganglia, I am not sure what happened to that though.

>     So, at the current time, we can configure the system in a couple
> of ways.  We can configure the system so that a host is considered
> removed from a cluster when the host has been down sufficiently long,
> or we can manually remove the host from the cluster by restarting all
> gmond daemons in the cluster.
>     Possible alternate approaches might include providing a command
> that could be sent to a 'gmond' daemon in a cluster to remove a host
> from the cluster.  It may be that there already exist mechanisms to
> restart all gmond daemons in a cluster, but this mechanism is not
> integrated into ganglia.  
> 
> So, thanks, I think I now understand what's going on.
> 
> Cheers, Chuck
> 
> 
> 
> Alex Balk wrote: 
> > Hi Chuck,
> > 
> > 
> > See below...
> > 
> > 
> > 
> > Chuck Simmons wrote:
> > 
> >   
> > > The number of cpus does get sorted out, but I don't believe that
> > > restarting 'gmond' is a solution.  The problem occurs after restarting
> > > a number of 'gmond' processes, and the problem is caused because
> > > 'gmond' is not reporting the information.  Does 'gmond' maintain a
> > > timestamp on disk as to when it last reported the number of cpus and
> > > insist on waiting sufficiently long to report again?  Does the
> > > collective distributed memory of the system remember when the number
> > > of cpus was last reported but not remember what the last reported
> > > value was?  Is there any chance that anyone can give me hints to how
> > > the code works without me having to read the code and reverse engineer
> > > the intent?
> > > 
> > >     
> > 
> > The reporting interval for number of CPUs is defined within /etc/gmond.conf.
> > For example:
> > 
> >   collection_group {
> >     collect_once   = yes
> >     time_threshold = 1800
> >     metric {
> >      name = "cpu_num"
> >     }
> > 
> > The above defines that the number of CPUs is collected once at the
> > startup of gmond and reported every 1800 seconds.
> > Your problem occurs because gmond doesn't save any data on disk, but
> > rather in memory. This means that if you're using a single gmond
> > aggregator (in unicast mode) and that aggregator gets restarted, it will
> > will not receive another report the number of CPUs till 1800 seconds
> > elapsed since the previous report.
> > The case of multicast is a more interesting one, since every node holds
> > data for all nodes on the multicast channel. The question here is
> > whether an update with a newer timestamp overrides all previous XML data
> > for the host. I don't think that's the case, it seems more likely that
> > only existing data is overwritten... but then, I don't use multicast, so
> > you may qualify this answer as throwing useless, obvious crap your way.
> > 
> > Generally speaking, there are 2 cases when a host reports a metric via
> > its send_channel:
> > 1. When a time_threshold expires.
> > 2. When a value_threshold is exceeded.
> > 
> > You're welcome to read the code for more insight, but a simple telnet to
> > a predefined TCP channel would probably be quicker. You could just look
> > at the XML data and compare pre-update and post-update values (yes,
> > you'll need to take note of the timestamps - again, in the XML).
> > 
> >   
> > > I understand that I can group nodes via /etc/gmond.conf.  The question
> > > is, once I have screwed up the configuration, how do I recover from
> > > that screw up?  I have restarted various gmetad's and various
> > > gmond's.  The grouping is still incorrect.  Exactly which gmetad's and
> > > gmond's do I have to shut down when.  And, again, my real question is
> > > about understanding how the code works -- how the distributed memory
> > > works.
> > > 
> > >     
> > 
> > As far as I know, you cannot recover from a configuration error unless
> > you've made sure host_dmax was set to a fairly small, non-zero value.
> > 
> > From the docs:
> > 
> >    The host_dmax value is an integer with units in seconds. When set to
> >    zero (0), gmond will never delete a host from its list even when a
> >    remote host has stopped responding. If host_dmax is set to a positive
> >    number then gmond will flush a host after it has not heard from it for
> >    host_dmax seconds. By the way, dmax means ``delete max''.
> > 
> > This way, once a host's configuration was modified to point at a
> > different send channel, the aggregator(s) on its previous channel will
> > forget about its existence once delete_max expires.
> > 
> > Personally, I don't use multicast due to various reasons, the main one
> > actually being its main advantage - every node keeps data on the entire
> > cluster. While this provides for maximal high availability, it also has
> > a bigger memory footprint. Especially when you have a few thousands of
> > nodes.
> > 
> >   
> > > I'd much rather be ignored than have people try to pawn off facile
> > > answers on me.
> > > 
> > >     
> > 
> > I'd provide you with more information on a possible setup which balances
> > high availability with performance, but I wouldn't want to overflow you
> > with useless data any more than I've done so far.
> > Let me know if you'd like more information.
> > 
> > Cheers,
> > Alex
> > 
> >   
> > > Cheers, Chuck
> > > 
> > > 
> > > 
> > > Bernard Li wrote:
> > >     
> > > > Hi Chuck:
> > > >  
> > > > For the first issue - give it time, it should sort itself out. 
> > > > Alternatively, you can find out which node is reporting incorrect
> > > > information, and restart gmond on it.
> > > >  
> > > > For the second issue, you can group nodes in different data_source
> > > > via the multicast port in /etc/gmond.conf.  Use the same port # for
> > > > nodes you want belonging to the same group.
> > > >  
> > > > You'll need to restart gmetad and gmond for the new groupings to take
> > > > effect.
> > > >  
> > > > Cheers,
> > > >  
> > > > Bernard
> > > > 
> > > > ------------------------------------------------------------------------
> > > > *From:* [EMAIL PROTECTED] on behalf of
> > > > Chuck Simmons
> > > > *Sent:* Wed 22/03/2006 17:54
> > > > *To:* ganglia-developers@lists.sourceforge.net
> > > > *Subject:* [Ganglia-developers] reorganizing clusters
> > > > 
> > > > I need help understanding two things.
> > > > 
> > > > I currently have a grid.  One of the clusters in the grid is named
> > > > "staiu" and the "grid" level web page reports that this has 8 hosts
> > > > containing 4 cpus.  In actuality, this has 8 hosts each containing 4
> > > > cpus, but apparently the hosts are not reporting the current number of
> > > > cpus to the front end.  Why not?  I recently restarted gmond on each of
> > > > the 8 hosts.
> > > > 
> > > > Another cluster is named "staqp05-08" and the "grid" level web page
> > > > reports that this has 12 hosts.  In actual fact, it only has 4 hosts. 
> > > > The extra 8 hosts are the 8 hosts of the 'staiu' cluster.  On the
> > > > cluster level page for staqp05-08, the "choose a node" pull down menu
> > > > lists the 8 staiu hosts, and the "hosts up" number contains the staiu
> > > > hosts, and there are undrawn graphs for each of the staiu hosts in the
> > > > "load one" section.  What do I have to do so that the web pages or gmond
> > > > daemons or whatever won't think that the staqp cluster contains the
> > > > staiu hosts?  What is the specific mechanism that causes this
> > > > association to persist despite having shutdown all staqp gmond daemons
> > > > and both the gmond and gmetad daemons on the web server, simultaneously,
> > > > and then starting up that collection of daemons?
> > > > 
> > > > Thanks, Chuck
> > > > 
> > > > 
> > > > -------------------------------------------------------
> > > > This SF.Net email is sponsored by xPML, a groundbreaking scripting
> > > > language
> > > > that extends applications into web and mobile media. Attend the live
> > > > webcast
> > > > and join the prime developer group breaking into this new coding
> > > > territory!
> > > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> > > > <http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642>
> > > > _______________________________________________
> > > > Ganglia-developers mailing list
> > > > Ganglia-developers@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/ganglia-developers
> > > > 
> > > >       
-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/



Reply via email to