Jason A. Smith wrote:
I have a few questions about ganglia development.  I am using ganglia on
RedHat 7.x i386 here.

1.  I tried the new network metrics that are in the latest cvs version
of the monitoring-core module and was wondering what happens on a dual
NIC computer, is just the first interface counted or does it total all
interfaces when calculating bytes/pkts_in/out?  I assume this number is
the rate between measurement intervals, correct?

I should field this since I wrote it. :)

The network code scans /proc/net/dev and ignores the loopback adapter (at the moment I don't remember whether the code simply skips the first line, which is typically loopback, or whether it actually looks for a token beginning with "lo").

Remaining traffic stats are summed. Every time the metric is collected, the data is saved and timestamped, so the data *should* actually be in bytes-per-second or packets-per-second.

2. I thought I saw a message about adding disk i/o metrics to gmond.

Yeah, that was probably me. :)

Is this currently in development and will it be in the upcoming 2.5.0
release?

I've implemented something like it in Solaris but those metrics are gathered using a different interface (namely, "not procfs"). Most of the metrics in linux.c are gathered by comparing cached timestamped results of a /proc textfile with its current incarnation. This could be adapted very easily to new metrics.

> We have a set of scripts that we use to collect some extra
data and publish it through gmetric.  Some of the extra parameters we
collect are: # of established tcp connections, various disk i/o stats
from iostat like read/write rates, average wait time and service time
for requests.  Parameters like this would be useful to include in the
monitoring core.  I haven't seen a TODO list, do you have a set of
parameters that will eventually be included in the core?

My fileservers are all running Solaris, so I don't have as much interest in porting that to Linux. However, iostat apparently gets a lot of its data from /proc/stat and /proc/partitions (especially /proc/partitions), so new metrics could certainly be added to monitor that stuff.

The main problem with adding new internal metrics is that metric names are converted to number via a compiled-in metric hash before they're broadcast. So if you add five metrics and the monitoring core multicasts "number_of_cokes_left_in_vending_machine" as internal metric #31, and another Linux (or Solaris or IRIX or ... ) monitoring core hears it that hasn't had the same hacks applied, it will map it to something else ("number_of_sniper_kills"), or discard it completely if it exceeds the number of entries in the hash.

If you check the archives you'll see that I made a big fuss about this a few weeks back. No resolution on that yet. :)

As the monitoring core is (IMO) at that point in its maturity that a lot of people are starting to use it and find it useful, I expect that the Linux version will probably lead the way in terms of what metrics the majority of users want. The feeling I get is that, once 2.x becomes fairly stable, we will start to see talk of development on 3.0, which I imagine will have a radically different architecture, based on some of the musings I've seen on this list.

3.  What is the long term development plan for gmetad?  Will it always
remain a perl script or will it eventually be rewritten in C?  I think I
saw an earlier message about the known problems with gmetad hanging or
dieing because of network problems or hosts not responding.  Are there
any ideas on a way to solve this problem?

The argument for gmetad being a perl script goes something like this:

"It's perl, it's flexible, and it deals with lots of strings."

Which I can't really find fault with. And it runs like a dream on my monitoring box (also Solaris), with the caveats mentioned in this quote. I have not heard of anyone *on Linux* having these problems so far, so it might be related to the Solaris implementation of perl's netcode/Socket.pm/who-knows-what.

But there is definitely talk of rewriting it in C. But, to paraphrase a Utah Phillips story, we have a rule around this here list that the person who complains about the code gets to write the replacement. Which is why I am always very careful to say, "Good lord, this is a big ugly Perl script! GOOD THOUGH!"

:)

So when can we expect your CVS checkin of gmetad.c?

4.  Just a warning:  Have you ever run gmond on hosts that are using
iptables for local firewalling?  I have tried it here and think there is
a bug with the iptables handling of multicast packets.  I put in a rule
to accept packets for 239.2.11.71 on port 8649, but several minutes
after starting the iptables firewall, the host stops receiving the
multicast packets from other hosts, it only sees the multicast packets
it sends out.  Then after stopping iptables it takes a minute or two
before it starts seeing them from other hosts again.  I verified this by
watching the packets with tcpdump on two hosts.  I don't have this
problem if I use an equivalent ipchains rule instead of iptables.

That sounds pretty wild, although I can't say I'm totally shocked that iptables messes with multicast. The monitoring core, as of CVS, has switched to libdnet for its network library needs... have you tried a recent CVS build in the same situation versus two 2.4.x builds?

And it could also be the kernel... are we having fun yet?

All of the work I'm doing is far, far inside a firewall so the question, to me, is unfortunately academic ... but maybe I've given you some ideas. (or given someone else reading this archived post some ideas?)

PS. Are suggestions or patches welcome if I have some ideas on
improvements with gmetad or its webfrontend?

I sure hope so. Nobody on this list seems to mind talking out design ideas, either.



Reply via email to