One more email on this subject before we hit the matresses for the new 
release.

So I think I am convinced that fixed sized names (expressed with hashes) are a 
good idea, provided we are careful and efficient with them. 

Each metric would carry a hash value as its name. Short, compact, easy to 
process, and expressive if used with an internal mapping table 
(hash->"fully-qualified metric name"). 

This means we need separate "branch" messages that create a branch. This is a 
good thing also because it allows us to specify attributes for a branch 
(which will be inherited by its children).

I think that some branches should be "well-known" by all nodes. These can 
house standard metrics that ship with ganglia. These branches do not need to 
be explicitly described. This mechanism gives a nice way to bootstrap the 
metric tree and reduces the number of "branch" messages, especially in the 
common case where there are no user-defined ones.

Therefore only custom branches get sent during the send_all_metric_data() call 
in listen.c. This function is used to send all local metrics when a new gmond 
is discovered.

Finally, I suggest that we make the name sent in an XDR packet a MD5 hash of 
the fully-qualified metric name. This 160bit hash is not too long, and since 
we do not know the names of user-defined branches a priori, the MD5 hash 
insures there will be no collisions. It is theorized that the MD5 algorithm 
yields a unique 160bit value for all possible strings.

> See above.  I also proposed (about four emails ago) a DNS-like metric
> resolver function that allows a libganglia-using client to submit a request
> for description of a metric ... with answers being provided by the oldest
> or second-oldest node.

I think the DNS-like protocol is too much. Basically you're right:

> And even if there is a failure, it's only on one node.  There's still n-1
> nodes out there that have an accurate picture of the cluster.

Someone (one of the n-1 nodes) will know about the branch and multicast it.

> Anyway, let's say that doesn't work, or your six-fig Cisco monkey shoved a
> banana in a switch somewhere and the "create-branch" message arrives after
> the metric itself.
>
> At this point we have two options:
>
> *  Discard the metric data, process the create-branch data, wait for the
> next metric transmission.  Straightforward but it means a hole in the data
> for up to t_max and that'd be a bummer if it's one of those 15-minute
> metrics.
>
> *  Guess at adding the metric data based on the payload type of the XDR.
> If you win and we have a string in there at least naming the actual metric,
> then we sock it into an "uncategorized" branch and query/wait for the
> branch data.  After the create-branch data is received, we update the
> lookup hash and the metric hash to move the guessed metric into its
> rightful place.  This is quite a bit more complicated, obviously.  And we
> can't report this metric until its rightful place is secured.

We'll have to think about this case some more. You made good points.

-Federico

Reply via email to