Federico Sacerdoti wrote:
So, as Steven and others have mentioned, we have a problem with ganglia
metrics. Metrics currently lie in a flat namespace, with no hierarchical
groupings. I have talked with Matt and Mason (a ganglia developer and my
boss) about this problem, and would like to state and define some of our
ideas.
Hey, you guys WERE listening all those times I went on and on about this
subject. :)
Another advantage of hierarchies comes from object-oriented design.
Attributes in the Branch tag, such as DMAX (when metrics get deleted),
become the default for all metrics below it. These can be overrided by
the individual metrics, analogous to overriding baseclass methods in an
OO class tree. This gives an easy way to assign attribute values to a
group of metrics.
It seems to me this would also make the "DSO-ification" of the monitoring
core a smoother process, not to mention a cleaner one from the standpoint
of those developing the DSO's. :)
A third advantage is cleaner namespaces. You can call 'cpu_num' simply
'num'. Similar naming simplifications are possible for the other
metrics. The most significant advantage is that we only have to worry
about name collisions among siblings in the tree. There can be a 'num'
metric in another branch (for example, the 'num' of network interfaces).
So how do we name metrics in the XDR packet if we adopt a metric
hierarchy? This is a difficult problem, since we want to allow new
metrics to appear at any time. Imagine an XDR packet comes in. We need
to identify the metric, and update its value in our hash tables.
I was thinking of "yet another hash" that has a hashed-up number based on
the name or hierarchy position of the metric as a key. The idea being,
this number is shorter than using the fully-qualified name of the metric
all the time.
So instead of encoding "cpu.idle" we encode 0x03FA450A and that field's 50%
shorter (even better if we get to "processes.top.1.cpu_percentage"), and
only have to multicast the real string name once. The hierarchical
information is stored (as a pointer, at the very least) in this hash.
What's really going to be key here is not so much the idea of making the
statically-#define'd metric hash dynamic, but keeping it up to date...
If we go far enough in this it'll look like SNMP, only more collaborative. :)
I believe the answer is that new nodes get their branch hierarchy all at
once from the oldest gmond in the cluster (which I will call the eldest
node). Matt has been talking about this for some time, as it will solve
some other problems as well. If we get an XDR metric packet that
specifies an unknown branch, we discard it. However, we realize that we
must have missed something, so we query the eldest node for their metric
hierarchy. If we can't find the eldest node, we query the second eldest,
etc. We also query the second eldest if we didn't learn anything new
from the eldest himself. (This solves the problem of the eldest node
having incomplete information).
I would suggest a fallback method (at least an option) of consulting an
"authoritative host." Maybe even a host running gmetad could be used as a
fallback (after all, it's going to have to keep track of all this stuff
too), although I don't necessarily think I'd recommend that.
At the very least this will help us during development, and it's possible
that some users might have a particular gmond running on "more reliable"
hardware (this isn't a dig at any one platform, I was thinking along the
lines of redundant PSUs and such) to be responsible for keeping track of
cluster metric metadata.
The assumption is that the eldest node has been listening to all the
"create-branch" messages, and has a complete metric tree.
This is gonna sound like DNS. If anyone doesn't know DNS, speak up now
before I get too snug in wearing my hostmaster hat again...
The primary node (eldest) may actively send sync'ing messages to the
secondary node (second-eldest) in case of the primary's untimely death.
Since I assume all traffic mentioned here will be on the multicast channel,
a separate conduit between primary and secondary is probably redundant -
eldest and second-eldest will behave identically except that the
second-eldest won't answer queries unless the eldest misses a heartbeat or
doesn't answer a query older than "query_timeout" seconds.
Individual nodes are always "authoritative" for branches of the metric tree
which they themselves have implemented. The query packet format needs to
have an optional destination field which contains the multicast hostname/IP
of a member node. If a node receives a query addressed to it from the
elder server, then it responds by sending its "create-branch/create-metric"
messages again to the cluster. This should be the only time this metric is
*rebroadcast* by a node.
On joining the network, a new node will announce itself and wait for the
heartbeats to start flowing in before it sends any multicasts besides
heartbeat, hostname, gmond_started and gmond_version. The elder gmond
should, upon receiving a new gmond heartbeat, transmit the metric tree.
The new gmond, as it receives the tree, compares it to its internal metrics
and sends "create-branch/create-metric" messages for each metric it
supports that is not in the metric tree received by the elder.
Cripes, this is turning into an RFC. Should I just write this up as such?
This email message is getting too long, but I would go on about how we
could use the idea of database indexes to quickly locate any branch in
the tree.
Heh, in that case that renders the first part of my message redundant. ;)
I hope I have been relatively clear about these ideas. I realize this
problem is pretty dense, and this solution is in its infancy. But the
point I would like to drive home is that a naming hierarchy is helpful
for specific reasons, and that its efficient implementation is possible
in the ganglia framework.
Dense, yes, but the area of metrics is just about the only one in the
Ganglia design that *doesn't* scale well (kudos, Matt & co.). I'm sure
that we can work this out if we just keep banging those rocks together. :)