One more email on this subject before we hit the matresses for the new release.
So I think I am convinced that fixed sized names (expressed with hashes) are a good idea, provided we are careful and efficient with them. Each metric would carry a hash value as its name. Short, compact, easy to process, and expressive if used with an internal mapping table (hash->"fully-qualified metric name"). This means we need separate "branch" messages that create a branch. This is a good thing also because it allows us to specify attributes for a branch (which will be inherited by its children). I think that some branches should be "well-known" by all nodes. These can house standard metrics that ship with ganglia. These branches do not need to be explicitly described. This mechanism gives a nice way to bootstrap the metric tree and reduces the number of "branch" messages, especially in the common case where there are no user-defined ones. Therefore only custom branches get sent during the send_all_metric_data() call in listen.c. This function is used to send all local metrics when a new gmond is discovered. Finally, I suggest that we make the name sent in an XDR packet a MD5 hash of the fully-qualified metric name. This 160bit hash is not too long, and since we do not know the names of user-defined branches a priori, the MD5 hash insures there will be no collisions. It is theorized that the MD5 algorithm yields a unique 160bit value for all possible strings. > See above. I also proposed (about four emails ago) a DNS-like metric > resolver function that allows a libganglia-using client to submit a request > for description of a metric ... with answers being provided by the oldest > or second-oldest node. I think the DNS-like protocol is too much. Basically you're right: > And even if there is a failure, it's only on one node. There's still n-1 > nodes out there that have an accurate picture of the cluster. Someone (one of the n-1 nodes) will know about the branch and multicast it. > Anyway, let's say that doesn't work, or your six-fig Cisco monkey shoved a > banana in a switch somewhere and the "create-branch" message arrives after > the metric itself. > > At this point we have two options: > > * Discard the metric data, process the create-branch data, wait for the > next metric transmission. Straightforward but it means a hole in the data > for up to t_max and that'd be a bummer if it's one of those 15-minute > metrics. > > * Guess at adding the metric data based on the payload type of the XDR. > If you win and we have a string in there at least naming the actual metric, > then we sock it into an "uncategorized" branch and query/wait for the > branch data. After the create-branch data is received, we update the > lookup hash and the metric hash to move the guessed metric into its > rightful place. This is quite a bit more complicated, obviously. And we > can't report this metric until its rightful place is secured. We'll have to think about this case some more. You made good points. -Federico
