While reading through my gobbledygook debug code, I'd like to point out that mucking around in the OSF internals did in fact drive me insane for a little while. So no making fun!

It looks like the metric functions are executing properly, though. Remember that the monitoring core doesn't automatically insert its own metrics locally - the data collection and transmission thread sends it out over the wire, and the mcast listening thread is the one that gets the XDR packets, decodes 'em and sticks them into the internal cluster hash.

And *that's* the one that's being dumped as XML to you when you telnet localhost 8649 (or run gstat or whatever).

So I would check out your multicast config, both in the monitoring core and on that system in general (what happens when you ping 239.2.11.71 while the monitoring core's running?).

Steve Feehan wrote:
On Tue, Nov 25, 2003 at 01:53:51PM -0800, Brooks Davis wrote:

On Tue, Nov 25, 2003 at 01:43:19PM -0800, steven wagner wrote:

I'm sorry to report that you should be getting metric data back on Tru64. Sadly, I can't offer any developmental support here now because all our Alpha are belong to dumpster (although for the record, I am the one to blame for the monitoring core running on Tru64 to begin with... sorry about that!).

The metrics aren't being reported by the monitoring core. Either something went wrong with the build (just because it compiled doesn't mean it really works ... ) or something is wrong at runtime. To check runtime, run the monitoring core in debug mode and see what kind of data you get out of it.

I'm not familiar with the Tru64 code, but a number of metric
implementations require that you be root to run them.  This means they
fail in intresting ways with the default behavior of running as nobody.
It might be worth a try to use the config file to cause gmond to run as
root all the time and see if that fixes the problem.

-- Brooks


Running gmond as root (ie. no_setuid   on) does not seem to make a
difference (although, if it did I don't think that would be such a
great idea).
I did have a small problem compiling machine.c -> machines/osf.c.
There was a line break in a string on line 163(i believe). Here is
a trivial patch:

diff -ur ganglia-monitor-core-2.5.5-orig/gmond/machines/osf.c
ganglia-monitor-core-2.5.5-mod/gmond/machines/osf.c
--- ganglia-monitor-core-2.5.5-orig/gmond/machines/osf.c    2002-09-06
15:06:15.000000000 -0400
+++ ganglia-monitor-core-2.5.5-mod/gmond/machines/osf.c 2003-11-24
15:07:14.000000000 -0500
@@ -163,8 +163,7 @@
     {
       alpha = 0.5 * (timediff / 30.0e7);
       beta = 1.0 - alpha;
-      debug_msg("* * * * Setting alpha to %f and beta to %f because
       timediff
-= %d",alpha,beta,timediff);
+      debug_msg("* * * * Setting alpha to %f and beta to %f because
timediff = %d",alpha,beta,timediff);
     }
   else
     {


I will try to rebuild and pay closer attention for other problems.

Running with -d2 here is a snip of the output:

pthread_attr_init
creating cluster hash for 2 nodes
hash_create size = 2
hash->size is 3
gmond initialized cluster hash
Using multicast-enabled interface alt0
mcast listening on 239.2.11.71 8649
XML listening on port 8649
listening thread(s) have been started
mcast_listen_thread() started 26375680
mcast_listen_thread() started 15824384
listening thread(s) have been started
cleanup thread has been started
multicasting on channel 239.2.11.71 8649
created monitor thread
set_metric_value() exec'd cpu_num_func (1)
set_metric_value() exec'd cpu_speed_func (2)
set_metric_value() exec'd mem_total_func (3)
set_metric_value() exec'd swap_total_func (4)
set_metric_value() exec'd boottime_func (5)
set_metric_value() exec'd sys_clock_func (6)
set_metric_value() exec'd machine_type_func (7)
set_metric_value() exec'd os_name_func (8)
set_metric_value() exec'd os_release_func (9)
set_metric_value() exec'd gexec_func (25)
set_metric_value() exec'd heartbeat_func (26)
my start_time is 1069797762
set_metric_value() exec'd mtu_func (27)
set_metric_value() exec'd location_func (28)
my location is unspecified
mcast_value() mcasting cpu_num value
encoded 8 XDR bytes
XDR data successfully sent
mcast_value() mcasting cpu_speed value
encoded 8 XDR bytes
XDR data successfully sent
mcast_value() mcasting mem_total value
encoded 8 XDR bytes
XDR data successfully sent
mcast_value() mcasting swap_total value
encoded 8 XDR bytes
XDR data successfully sent
mcast_value() mcasting boottime value
encoded 8 XDR bytes
XDR data successfully sent
mcast_value() mcasting sys_clock value
encoded 8 XDR bytes
XDR data successfully sent
mcast_value() mcasting machine_type value
encoded 16 XDR bytes
XDR data successfully sent
mcast_value() mcasting os_name value
encoded 12 XDR bytes
XDR data successfully sent
mcast_value() mcasting os_release value
encoded 12 XDR bytes
XDR data successfully sent
set_metric_value() exec'd cpu_user_func (10)
mcast_value() mcasting cpu_user value
encoded 8 XDR bytes
XDR data successfully sent
set_metric_value() exec'd cpu_nice_func (11)
mcast_value() mcasting cpu_nice value
encoded 8 XDR bytes
XDR data successfully sent
set_metric_value() exec'd cpu_system_func (12)
mcast_value() mcasting cpu_system value
encoded 8 XDR bytes
XDR data successfully sent
set_metric_value() exec'd cpu_idle_func (13)
* * * * Setting alpha to 0.000147 and beta to 0.999853 because timediff
* = 0
CPU: Just ran table().  Got:  usr 6630 , nice 0 , sys 32021 , idle
18884810, 1024hz.
CPU:--before-------------------------------------------------------------
CPU cycles:
CPU:   now: 6630 , 0, 32021, 18884810   old:  6626 , 0 , 32014 ,
18884803 diffs: 0, 0, 0, -1073217344
CPU:i is 0 : new - old = difference, delta  6630 - 6626 = 4,4
CPU:i is 1 : new - old = difference, delta  0 - 0 = 0,4
CPU:i is 2 : new - old = difference, delta  32021 - 32014 = 7,11
CPU:i is 3 : new - old = difference, delta  18884810 - 18884803 = 7,18
CPU:percentages - half_total is 9, total_change is 18
CPU:--after--------------------------------------------------------------
CPU cycles:
CPU:   later: 6630 , 0, 32021, 18884810   old:  6630 , 0 , 32021 ,
18884810 diffs: 4, 0, 7, 7
CPU: ** ** ** ** ** Are percentages electric?  Try user 222%, nice 0%
, sys 389% , idle 389%

....

and much more...

--
Any statement of the form "X is the one, true Y" is FALSE.
PGP fingerprint 655D 519C 26A7 82E7 2529  9BF0 5D8E 8BE9 F238 1AD4








Reply via email to