[Ganglia-general] multiple gmetads polling single gmond

2008-04-25 Thread Ben Hartshorne
Hi,

I have a rather large set of machines I have ganglia watch (~6000), and
am trying to build out a resilient infrastructure.  I ran into an
interesting problem.

I am using gmond version 3.0.2.200511011714 (as reported by --version)

Basic layout - each location (~2000 machines) has a pair of hosts to
which they send their metrics (unicast).  There are a pair of machines
that connect to gmond on each of the edge collectors and centralize the
data (they connect via TCP to port 8649).  We also have another pair of
machines that connect to each edge gmond and grab the current XML dump
for integration with  Nagios (the script is called parse_ganglia for
future reference).

This worked nicely for quite a while, until one of our edge hosts got
too many reportees.  There was a connection timeout in parse_ganglia of
5 seconds, so that when one of the edge hosts was down it would move on
to the other edge hosts quickly rather than waiting 60s for the down
host.  When one of the hosts got too many reportees, it started to take
~6s to transfer all the data.  At this point, one or the other of the
pair of hosts running parse_ganglia started failing on the edge host
that had too many reportees.  

Using tcpdump, I found that though gmond was accepting the connection
from both of them, it would only send data to one at a time, and it
complete sending data to the first before moving on to the second.  so:
* host a connects
* host a starts getting data
* host b connects (3-way handshake complete) but no data flows
* host a finishes sending data
* host b starts getting data
* host b finishes getting data

We solved the immediate problem by increasing the timeout from 5 to
15s., but I was a little surprised that gmond behaved in this
seemingly-single-threaded manner.

While it's easy for us to adjust the timeout in our python
parse_ganglia, it is not so easy to poke at gmetad, and I am worried
about what will happen when we have variations in network quality, more
hosts requesting metrics, etc.  

Is it true that gmond is single threaded in its network operations?  Or
maybe just the listener?  What other effects might this have?  

Would it make sense to change gmond so it passes off dumping the XML
feed to a child thread so that multiple simultaneous connections can be
handled?

Thanks for your time,

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature
-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Block device I/O bandwidth

2008-03-26 Thread Ben Hartshorne
On Mon, Mar 24, 2008 at 03:07:37PM -0700, Bernard Li wrote:
 Hi guys:
 
 I am curious as to what folks usually do to measure block device I/O
 bandwidth (MB/s) with their Ganglia installation.  Talking
 specifically about disk I/O, do you guys usually just use the output
 of iostat -k or something like that?

confirming the response from many people on this list, I use iostat -x
as well.  While the gmetric plugin library was down (is it back up?) I
created http://ben.hartshorne.net/ganglia/ which includes two
crontab-ready shell scripts to grab different bits of data from iostat
and stuff them into ganglia. (disk_gmetric.sh and disk_wait_gmetric.sh)

enjoy,

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature
-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] host spoofing, gmetric, and gmond

2008-01-24 Thread Ben Hartshorne
Hi All,

I'm trying to get ganglia metrics from some hosts that are hiding behind
a NAT box.  Obviously, since ganglia identifies the sender using reverse
DNS on the sending host, this does not work.

I have read about the host spoofing patch, and have three questions:
* does it actually spoof the IP address in the IP header, or does it
  insert some extra information into the XML stream saying 'hey, I'm
  spoofing this other computer, ignore my actual IP address'? [1]
* does the spoofing only work in gmetric, or is there a way to ask gmond
  to spoof addresses using the same logic?
* Is there some reason it would be a bad idea to have *every* reporting
  host spoof their own IP address?  Is there a big performance hit or
  anything?  Because I'd almost rather just have every host report who
  they are in the stream and then I don't need to worry about the
  network layout nearly so much.

Thanks,

-ben

[1] If the former, it will get squished by the NAT and won't work.  If
the latter, it will get through the NAT and all will be well.  I'm
guessing it's the latter, because otherwise you wouldn't be able to use
TCP to send the information (since the handshake would never complete).

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] host spoofing, gmetric, and gmond

2008-01-24 Thread Ben Hartshorne
On Thu, Jan 24, 2008 at 11:01:35AM -0800, Matthias Blankenhaus wrote:
  I'm trying to get ganglia metrics from some hosts that are hiding behind
  a NAT box.  Obviously, since ganglia identifies the sender using reverse
  DNS on the sending host, this does not work.

 Wrt to your NAT problem:  I don't really see your problem.  Can't you have 
 one gmond behind the NAT that all other machines behind that NAT point to? 

Though I agree that would be the best solution, due to the network
architecture I can't add any hosts to the area behind the NAT, nor can I
add any load to the existing hosts.  :(

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] RRDs in memory

2007-07-13 Thread Ben Hartshorne
On Wed, Jul 11, 2007 at 03:34:41PM -0400, Ofer Inbar wrote:
 gmetad is very write-intensive, because it updates hundreds of RRD
 files about every minute or two.  Has anyone tried running it with
 the rrd directory on a RAM disk (tmpfs) ?

I'll toss in my $.02 here as well, though many people have already said
the same thing.

I created a ramdisk when my cluster grew beyond ~50 nodes (I report a
lot of extra statistics).  I use an actual ramdisk instead of tmpfs
(though I chose it out of ignorance when I first set it up, wikipedia[*]
says that tmpfs might swap to disk whereas ramfs is just straight up in
memory, nothing fancy).  

Instead of reconfiguring ganglia to keep the repositories in
/mnt/ram0/rrds or mounting the ramdisk in /var/lib/ganglia/rrds, I
mounted the ramdisk in /mnt/ram0 and made /var/lib/ganglia/rrds a
symlink to /mnt/ram0/rrds.  Just my preference...

I wrote a new script to drop in /etc/init.d/ called, inventively enough,
setup_gmetad_ramdisk, which starts before gmetad and stops after it.  It
creates the ramdisk, formats it, and copies over the backed up rrds.
When stopped, it backs up the rrds.  Theoretically, this should make
system bootup and shutdown work the same as though it were on disk.
Unfortunately, I am missing some part of installing the stop script
correctly (in the right runlevel or something) so it doesn't actually
work on shutdown.  :(  I imagine the fix is pretty simple, but I havn't
bothered yet.

I had to edit grub.conf to adjust the size of the ramdisk.  By default
they're 64MB, but with an argument to the kernel start line, you can set
it to whatever size you need.  I chose 4x the current RRD directory, to
accomodate new hosts and more metrics.  It is unfortunate that a reboot
is required to change the size of the ramdisk.

I also set up a cronjob to backup the rrds themselves every hour, but
unlike the folks so far, instead of rsyncing or keeping just one copy, I
keep 8 days worth of hourly snapshots, so that if something goes wrong,
I can get back to a healthy snapshot.  (Note - I have never actually
used any snapshot further back than the most recent...  ;)  (Note2 - the
first version of this used 'find' to get anything 8d old, and it
started really tearing up the disk as the number of hosts/metrics grew.
Now I use perl to create the timestamp from 8 days ago and just rm the
directory.  This will fail if the host is down for more than an hour,
but that's OK by me.)


The backup cronjob and new ramdisk start script are all available off my
website http://ben.hartshorne.net/ganglia/

-ben

[*] http://en.wikipedia.org/wiki/TMPFS

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Using cluster name to differentiate clusters?

2007-02-26 Thread Ben Hartshorne
On Mon, Feb 26, 2007 at 01:06:23PM -0600, Seth Graham wrote:
 Ben Hartshorne wrote:
 
 It seems to me that using the name to determine cluster membership would
 simplify things for the people configuring ganglia.
 
 It would, but when you have 3000+ machines all chattering on the same 
 port that's a lot of data for a machine to deal with. Not only do the 
 aggregating machines have to hold it all in memory, but the gmetad host 
 has to dump all that info into the rrds.

Isn't the machine going to have to handle exactly the same amount of
data, regardless of whether its on one port or two?  I would imagine
that by the time your network got to 3000+ hosts, things would be
segregated in their own right, independent of ganglia.  Such segregation
would make it easy (and more logical) to use head nodes as aggregators
and then pass data up the tree to your main web interface.  Multicast
networks can be broken up by subnet or VLAN, and the unicast nodes can
use ganglia's ability to only pass on summary info, etc.  

Of course, I have not had the privilege of working with a cluster of
that size.  I've only got just over 100 hosts, so please forgive
anything that will become obvious as soon as I actually have to deal
with the problem...  ;)

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Ganglia web site makeover

2006-12-29 Thread Ben Hartshorne
On Thu, Dec 28, 2006 at 02:40:52PM -0800, Peter Mui wrote:
 Hi All (at ganglia-general):
 
 I've been talking to Matt Massie about re-doing the Ganglia website  
 at http://ganglia.info/
 
 We're open to any or all ideas at this point so feel free to chime in  
 with whatever comes to mind.

I actually like most aspects of the ganglia website.  Even so, feeling
the need to chime in, I give these three requests:

* take the Main Menu to a top bar, and make it consistent across all
  sections of the website.  My biggest website gripe is consistency -
  the entire site should have a consistent look-and-feel.  A global menu
  is one step towards this goal.

* Fix the gmetric repository.  There are a whole bunch of neat gmetrics
  floating around, and several different versions of metrics that do
  basically the same thing.  I even went so far as to start my own
  gmetric page because I didn't like most of the metrics there.
  Modeling the gmetrics section after the Firefox extensions page would
  be awesome - allow ratings to float the best to the top, but also
  categorize so alternatives within a particular metric (say, taking
  detailed disk metrics) are available.

* Fix the Documentation section and add a FAQ, wiki style.  There's a
  lot of good info out there, but it's hard to get to.  A nice section
  of a wikiable FAQ would be complete sample configuraitons (actually
  taken from practice that *really* work, rather than ways you should be
  able to set it up...  ;)

I realize that some ofthese things are rather grandiose in scope and I
hides face havn't actually offered my help.../hides face.  What can
I say.  Here's hoping for the best.  :)

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Q: IO metrics in Ganglia

2006-11-27 Thread Ben Hartshorne
On Sun, Nov 26, 2006 at 11:38:46AM +0200, Vitaly Karasik wrote:
 I noticed that there is no Ganglia disk I/O metrics for Linux and MS
 Windows platforms?
 Can you recommend me some tools/plugins for collecting IO metrics?
 (except of writing a custom scripts around iostat)

I did write a custom script to surround iostat because I wanted some
additional metrics than what is presented by default.  Specifically, I
was interested in the %util metric from iostat.  

You can find that script at http://ben.hartshorne.net/ganglia/ to use or
modify to suit your environment.

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Q: is it possible to see a specific day for example, from last week

2006-11-16 Thread Ben Hartshorne
On Sun, Nov 12, 2006 at 03:23:34PM +0200, Vitaly Karasik wrote:
 Is there some Ganglia version (beta/patched) which allow me to see a
 graphs for specific day from a last week, for example?

Vitaly,

A while ago (May 8th, 2006), there was a thread in which a user offered
patches to ganglia to provide this functionality.  For help searching
through the archives, the subject line was  Graph templates (custom
graphs)

Those patches are available at http://wtf.ath.cx/screenshots.html.  I
had some trouble getting them to work (had to upgrade RRDTool), but I
eventually did and have found the extra functionality useful on
occasion.  Of course, you must remember that the RRD decreases
resolution over time, and so depending on how far back you look, the
graphs become less and less useful.  But that's just someting we accept
using RRD as the backend storage for Ganglia.

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] problem about gmond,help me~~~

2006-11-07 Thread Ben Hartshorne
On Mon, Nov 06, 2006 at 09:50:41AM -0500, Rick Mohr wrote:
  but when I configured this node's udp_send_channel to itself.
  it works OK.But if that, I can only get one node infomation.
  and one issue  confused me is that  why  the two node can only send
  udp_send_channel to itself?
  If I changed the udp_send_channel to other node,it does.t work.why this
  happened?

I hope you've already checked this - do you have a firewall enabled on
either host blocking the packets?  in the output of '/sbin/iptables
-nL', you should see something like this:
Chain INPUT (policy ACCEPT)
target prot opt source   destination 
... excerpted
ACCEPT udp  --  0.0.0.0/00.0.0.0/0   state NEW udp 
dpt:8649 

If you don't have a rule accepting ganglia data (or your firewall is
completely open), the traffic will be blocked.

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] about unicast configuration

2006-11-03 Thread Ben Hartshorne
On Fri, Nov 03, 2006 at 09:32:34AM -0800, khaja mohideen wrote:
 Hi,
 
I have installed  configured Ganglia to monitor my cluster env.
My Env. consists of about 400 systems.  Currently i have configured
with default multicast support.
 
 
I am in need of unicast support to reduce the multicast traffice 
also to get gmond stats from other subnets.  I tried with docs
 
Could any one  help  in giving a simple config example for one to
one unicast configuration.

Khaja,

In my environment (unicast, multiple subnets, one agregator host):
in /etc/gmond.conf
udp_send_channel {
  host = 10.20.30.40
  port = 8649
}
udp_recv_channel {
  port = 8649
}
(other non-unicast portions of the config file deleted)

in /etc/gmetad.conf 
data_source mycluster localhost
eof

-ben



-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Problem with metrics

2006-09-22 Thread Ben Hartshorne
On Wed, Sep 20, 2006 at 08:35:19PM +0300, Alex Balk wrote:
 Hi,
 
 
 As far as I recall, the web frontend's code creates a list available
 metrics from the first node in the list.
 
 If this node doesn't have any gmetrics, then only the builtin metrics
 will appear in the menu.
 
 Once you choose one of the metrics, the nodes are sorted based on it (in
 your case, the first node will now be the one with the highest count on
 bytes_out). I suspect that this node doesn't have any gmetrics.
 
 
 Can you confirm/dispute this?

This is the default behavior.  Someone posted a patch to this list a
while ago to change it so that it displays all metrics.  

I'm continuing my trend of being really helpful by not knowing when it
was or who sent it. :)  But it's there!  

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


[Ganglia-general] windows CPU I/O wait metric

2006-08-23 Thread Ben Hartshorne
Hi,

I've installed the windows version of gmond and it constantly reports
about 70% I/O wait in the CPU report.  The server is not actually
experiencing very much disk activity.  I remember a bug from several
versions ago that was similar (though I don't remember the details).  Is
this the same thing?  Any ETA on a windows build for a more recent
version?

Thanks,

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Obtaining Immediate Interval Data From Ganglia

2006-08-09 Thread Ben Hartshorne
On Tue, Aug 08, 2006 at 04:22:41PM +0100, Ian Wootten wrote:
 I am facing a problem in that I would like short-segment up to date 
 information from ganglia in order to monitor services after invocation.

One method I have heard of that achieves something similar; write a
separate module that interprets the XML feed directly.  This would allow
you to completely control the resolution and time frame for the data you
need.  In the implementation I heard, this module was called by Nagios
and would send out an alert if it sensed a problem.  I was actually
quite impressed because it means that Nagios doesn't need to run 80
bajillion processes to monitor many many hosts.  Instead, it just
listents to the XML stream from ganglia and notices when a host drops
off the map or a metric goes out of where it's supposed to be.  

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] install and configure ganglia

2006-07-17 Thread Ben Hartshorne

Toney,

Have you verified that rrdtool itself works?  It may not be a problem
with ganglia you're looking at but a problem with rrdtool itself.

I'm not sure what might show up in the apache logs, but that may be a
good place to look as well.  

You should be able to find your rrd databases for ganglia on the node
running gmetad (which may be all the nodes).  My installation puts them
in /var/lib/ganglia/rrds, but I'm not sure if that's the default
location.

Assuming they are there...

bash$ cd /var/lib/ganglia/rrds/__SummaryInfo__/
bash$ rrdtool graph /tmp/foo.png --end now --start end-12s \
 --width 400 DEF:myline=cpu_nice.rrd:sum:AVERAGE \
 LINE1:myline#FF:foo\n
bash$ file /tmp/foo.png
/tmp/foo.png: GIF image data, version 87a, 480 x 155
bash$ xv /tmp/foo.png #or some other way of viewing it

If foo.png is a real graph, then you have verified that rrdtool is
working correctly.  If you cannot get rrdtool to create a graph for you,
you should investigate why it is not working correctly before continuing
to troubleshoot ganglia.

-ben


On Mon, Jul 17, 2006 at 12:13:25PM +0530, toney samuel wrote:
 Hi, i have specified the path as per your instruction but still i am not
 getting and graph in the web page.
 
 On 7/15/06, matt massie [EMAIL PROTECTED] wrote:
 
 toney samuel wrote:
  Hi i have installed as per the instructions on this link
 
  http://www.ibm.com/collaboration/wiki/display/WikiPtype/ ganglia
  http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia
 
  I am able to get the ganglia page and also the status of my node. I am
  not getting the graphics ( graphs ). i have installed rrdtool in
  /usr/local/rrdtool-1.2.3 i have also specified the rrdtool path in
  /var/www/html/ganglia/conf.php
 
  define(RRDTOOL, /usr/local/rrdtool-1.2.3);
 you are so close.  RRDTOOL is not the path to the directory but rather
 the path to the binary.
 
 define(RRDTOOL, /usr/local/rrdtool-1.2.3/bin/rrdtool);
 
 should work for you.
 
 good luck
 -matt
 
 
 
 
 
 

 
 -
 Using Tomcat but need to do more? Need to support web services, security?
 Get stuff done quickly with pre-integrated technology to make your job easier
 Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
 http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642

 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] counters for gmetric

2006-07-14 Thread Ben Hartshorne
On Thu, Jul 13, 2006 at 10:44:46AM -0700, dan c wrote:
 
  I'm trying to use gmetric to record data from several counters that
  do not reset after they've been read. I tried setting the slope to
  negative, positive and both, but all three produce identical output.
  Any ideas?

My solution was to make a directory /var/lib/ganglia/metrics/ and put
state files there.  By recording the time and the value the last time
gmetric was run, you can calculate the average change per unit time
without predefining the time period.  I usually use 2 minutes (from
cron), but of course the timeperiod you should use is determined by your
data resolution requirements, as well as how the data changes.  Also,
it's nice to be able to change it without touching either the statefile
or the script, and it deals with changes in load that might cause your
script to run something other then *exactly* every 2 minutes.  (Mine
still crash when the load gets so high that several instantiations of th
script build up without being able to run, and then run all at once;
they complain if the time diff is zero.)

Look at just about any of the scripts in
http://ben.hartshorne.net/ganglia/ for examples of how I have done this.
Some of the metrics that I measured that don't reset include the mysql
Questions count (to calculate queries per second) and the disk
activity straight from /proc (for which you also need to deal with your
counter looping back to zero after hitting its max bound).

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] counters for gmetric

2006-07-14 Thread Ben Hartshorne
On Fri, Jul 14, 2006 at 10:12:32AM -0700, dan c wrote:
 
  Ben, thanks for your reply. I've considered keeping state
  files and calculating the differences, but this doesn't
  scale very well for a large number of hosts. RRDTool

What doesn't scale?  The calculation is done on the host sending the
statistic, not the host recieving the statistic, so each host does its
own little thing and the collectors get the same data they otherwise
would.  The load is distributed, so the addition of another host doesn't
increase the load on any collector (other than the impact it will
obviously have, being another host).

I currently use this technique for 80 hosts * 5 stats * 2 minutes.  

OTOH, I agree that it'd be great if you didn't have to worry about it
and rrdtool just did the Right Thing(TM).

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] install and configure ganglia

2006-07-11 Thread Ben Hartshorne
On Tue, Jul 11, 2006 at 03:52:03PM +0530, toney samuel wrote:
 i have downloaded ganglia-3.0.3.tar.gz but i am not clear how to get it
 working, do i need to install and apache. pls tell me if i want to download
 any other packages.

Hi Toney,

I would recommend starting here:
http://ganglia.info/docs/ganglia.html#installation

You will need to have apache installed on your head node, where you want
to view the stats through the web-based interface.

You also need to have some other packages installed, such as rrdtool.  I
believe the installation process will stall at certain parts if the
required packages are not there.  It will prompt you to install what you
need.  

Good luck, and please mail back if you have more specific questions,

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Graph templates (custom graphs)

2006-05-09 Thread Ben Hartshorne
On Sun, May 07, 2006 at 11:28:46PM +0300, Alex Balk wrote:
 Hi all,
 
 I've done some work extending my patch for custom graphs and it now
 includes the following features:
 ...

Alex,

this looks awesome.  I can't tell you the number of times I wanted
something like this.  Thank you!

Now for the rest...  it doesn't work for me.  ;)

The patch applied just fine, but I get php errors in my error_log and
the image that appears is broken (i.e. broken image symbol from the
browser, not an image that is somehow not correct).

The errors are as follows:

when i click on 'custom graph'...

[client 192.168.25.9] PHP Notice:  Use of undefined constant referer_url - 
assumed 'referer_url' in /var/www/html/ganglia/custom_graph_interface.php on 
line 109, referer: http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Use of undefined constant sec - assumed 
'sec' in /var/www/html/ganglia/custom_graph_interface.php on line 476, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined variable:  time_range in 
/var/www/html/ganglia/custom_graph_interface.php on line 477, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:   in 
/var/www/html/ganglia/custom_graph_interface.php on line 477, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 508, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 541, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 581, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 611, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 631, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 651, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 675, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined variable:  option_1_selected in 
/var/www/html/ganglia/custom_graph_interface.php on line 682, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 712, referer: 
http://localhost:8080/ganglia/?
[client 192.168.25.9] PHP Notice:  Undefined index:  interface_mode in 
/var/www/html/ganglia/custom_graph_interface.php on line 755, referer: 
http://localhost:8080/ganglia/?

when i have filled out the values and want to create the graph (in basic mode):

[client 192.168.25.9] PHP Notice:  Undefined variable:  opt_cmdline in 
/var/www/html/ganglia/custom_graph_rendering.php on line 138, referer: 
http://localhost:8080/ganglia/custom_graph_processing.php
[client 192.168.25.9] PHP Notice:  Undefined variable:  metrics_cmdline in 
/var/www/html/ganglia/custom_graph_rendering.php on line 182, referer: 
http://localhost:8080/ganglia/custom_graph_processing.php
[client 192.168.25.9] PHP Notice:  Undefined variable:  legend_header in 
/var/www/html/ganglia/custom_graph_rendering.php on line 261, referer: 
http://localhost:8080/ganglia/custom_graph_processing.php
ERROR: unknown function 'VDEF'

There are also lots of errors as I'm filling out the form, but i figure
those might be less important.  

Am I missing a required dependency?  ganglia works fine in this
installation.  

Thanks,

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Graph templates (custom graphs)

2006-05-09 Thread Ben Hartshorne
On Tue, May 09, 2006 at 09:51:52PM -0700, Ben Hartshorne wrote:
 On Sun, May 07, 2006 at 11:28:46PM +0300, Alex Balk wrote:
  Hi all,
  
  I've done some work extending my patch for custom graphs and it now
  includes the following features:
  ...
 
 Alex,
 
 this looks awesome.  I can't tell you the number of times I wanted
 something like this.  Thank you!
 
 Now for the rest...  it doesn't work for me.  ;)
 
 The patch applied just fine, but I get php errors in my error_log and
 the image that appears is broken (i.e. broken image symbol from the
 browser, not an image that is somehow not correct).
 
 The errors are as follows:

another note - in my installation, I have three data sources for my
gmetad.  The grid has three clusters.  When I enter the custom graph
pages, it says 'custom graph for: unspecified'.  If I remove two of the
three sources, it says 'custom graph for: ksjc' (the remaining cluster).

does this not work with more than one cluster?

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


[Ganglia-general] Metric pull-down menu not showing all metrics

2006-05-04 Thread Ben Hartshorne
All,

I am curious how the Metric menu in the cluster view gets populated.  I
have a number of my hosts reporting metrics that the others don't.   For
example, the hosts that are running mysql and replicating from a
different database report how many seconds they are behind their master,
but only 10 out of my 30 hosts run mysql.  

The mysql_slave ganglia metric does not usually show in the Metric
pull-down menu.  

Previously, I had only one cluster, so clicking on 'Grid' just went
straight to the cluster.  For some reason, after clicking on 'Grid', I
could see all the metrics that are reported.  As soon as I chose a
metric, only some of the metrics were present in the Metric pull-down
menu.  I think only the metrics present on the first host in the cluster
list are present in the pull-down menu.  

Now I have more than one cluster in my grid, so clicking on Grid no
longer gives me all the metrics in the Metric menu.  I am now unable to
see my mysql_slave metric without manually typing it into the URL
string.

Suggestions?

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Altering metrics parameters

2006-05-04 Thread Ben Hartshorne
On Thu, May 04, 2006 at 01:00:33PM +0800, [EMAIL PROTECTED] wrote:
 
 G'day all
 
 Newbie type question but I can't seem to find a readily available answer.
 
 I'd like byte counts in and out from my nodes... but on other interfaces (eg.
 eth2 is used for gmond but I'm interested in eth0 and eth1).
 
 gmond.conf doesn't seem to offer such an option. Do I need to modify the 
 source
 or do I run gmeter with an ifconfig+grep special?

I wrote a script to report individual interface values, and then edited
the ganglia PHP to display a network report (along with load, CPU,
memory, etc.) that showed each interface in a different color.

The script is at http://cryptio.net/~ben/ganglia/network_gmetric.sh 

I think I only modified conf.php and graph.php.

I added the following lines to conf.php

-=-=-=-=-=-=-=-=-=8=-=--=-=-=-=-=-=8-=-=-=-=-=--
#
# Colors for the split network report graph
#
$total_rx_color = FF;
$total_tx_color = FF;
$eth0_rx_color = 33;
$eth0_tx_color  = 00FF00;
$eth1_rx_color = FF00FF;
$eth1_tx_color  = 00;

-=-=-=-=-=-=-=-=-=8=-=--=-=-=-=-=-=8-=-=-=-=-=--

I made the following changes to graph.php

-=-=-=-=-=-=-=-=-=8=-=--=-=-=-=-=-=8-=-=-=-=-=--

[EMAIL PROTECTED]:/var/www/html/ganglia$ diff -c graph.php.orig graph.php
*** graph.php.orig  2005-05-09 11:27:45.0 -0700
--- graph.php   2005-09-27 16:20:26.0 -0700
***
*** 18,25 
  # Assumes we have a $start variable (set in get_context.php).
  if ($size == small)
  {
!   $height = 40;
!   $width = 130;
  }
  else if ($size == medium)
  {
--- 18,25 
  # Assumes we have a $start variable (set in get_context.php).
  if ($size == small)
  {
!   $height = 60;
!   $width = 200;
  }
  else if ($size == medium)
  {
***
*** 176,181 
--- 176,215 
 .LINE2:'bytes_in'#$mem_cached_color:'In' 
 .LINE2:'bytes_out'#$mem_used_color:'Out' ;
   }
+   else if ($graph == split_network_report)
+  {
+ $style = Split Network;
+
+ $lower_limit = --lower-limit 0 --rigid;
+ $extras = --base 1024;
+ $vertical_label = --vertical-label 'Bytes/sec';
+
+ $series = 
DEF:'total_tx'='${rrd_dir}/network_tx.rrd':'sum':AVERAGE 
+.DEF:'total_rx'='${rrd_dir}/network_rx.rrd':'sum':AVERAGE 
+.DEF:'eth0_rx'='${rrd_dir}/eth0_rx.rrd':'sum':AVERAGE 
+.DEF:'eth0_tx'='${rrd_dir}/eth0_tx.rrd':'sum':AVERAGE 
+.DEF:'eth1_rx'='${rrd_dir}/eth1_rx.rrd':'sum':AVERAGE 
+.DEF:'eth1_tx'='${rrd_dir}/eth1_tx.rrd':'sum':AVERAGE 
+.LINE3:'total_tx'#$total_tx_color:'Total TX' 
+.LINE3:'total_rx'#$total_rx_color:'Total RX' 
+.LINE2:'eth0_tx'#$eth0_tx_color:'Eth0 TX' 
+.LINE2:'eth0_rx'#$eth0_rx_color:'Eth0 RX' 
+.LINE2:'eth1_tx'#$eth1_tx_color:'Eth1 TX' 
+.LINE2:'eth1_rx'#$eth1_rx_color:'Eth1 RX' ;
+  }
+   else if ($graph == disk_report)
+  {
+ $style = Disk;
+
+ $lower_limit = --lower-limit 0 --rigid;
+ $extras = --base 1024;
+ $vertical_label = --vertical-label 'Blocks/sec';
+
+ $series = 
DEF:'disk_writes'='${rrd_dir}/disk_writes.rrd':'sum':AVERAGE 
+.DEF:'disk_reads'='${rrd_dir}/disk_reads.rrd':'sum':AVERAGE 
+.LINE2:'disk_writes'#$mem_cached_color:'Write' 
+.LINE2:'disk_reads'#$mem_used_color:'Read' ;
+  }
else if ($graph == packet_report)
   {
  $style = Packets;

-=-=-=-=-=-=-=-=-=8=-=--=-=-=-=-=-=8-=-=-=-=-=--

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Metric pull-down menu not showing all metrics

2006-05-04 Thread Ben Hartshorne
Richard, Rick,

Thank you both for your replies.  Rick, either your or my mailer decided
to wrap your patch at 72chars wide, so rather than try and use `patch`
to apply it, I just applied it by hand.  I'm smarter than patch,
anyways.  ;)  Some of the line numbers were just a touch off, but close
enough that I think the patch would apply cleanly to the 3.0.3 code, if
one were to choose to do so.

It works like a charm!

Thanks,

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] A Java Virtual Machine probe

2006-03-28 Thread Ben Hartshorne
Miguel,

I look forward to trying this out.  You say 'Installation requires
ganglia to be installed using the source code.' - I am curious what
files you require.  I installed ganglia using an RPM, and would like to
bring in only the files I need.  Is this a bad idea?  What do you
recommend?  (and why would a compiled C program require sources to run
with other compiled programs?)

Thanks,

-ben

On Tue, Mar 28, 2006 at 05:17:29PM +0100, José Miguel Pereira Tavares wrote:
 
   Hi all!
 
   Sometime ago I enquired about the existence of a probe that would give 
 information on a running Java Virtual Machine (JVM). Unfortunately it's 
 a Linux only probe (for now at least).
 
   Now I am happy to present to the community a probe developed in C that 
 can monitors a JVM and report to a gmond. It's publishing of metrics is 
 similar to that of gmetric.
 
   It works by parsing the info at /proc/pid/status of the Linux tasks 
 relevant to the JVM (depending on the kernel the status file reports 
 just one process with threads or a set of tasks that represent 
 process/threads).
 
   Although this probe is now used for monitoring Java services it can be 
 used to monitor any other kind of process that has a long time span. 
 It's a kind of process oriented metrics.
 
   It was done in C for the usual question on intrusiveness. Memory 
 footprint is around 1.5 Mb Vss and it uses 0.1% of the CPU with 10 
 seconds of interval between samples.
 
   This probe can be downloaded at:
 http://student.dei.uc.pt/~mtavares/index.php?content=software
 
   Installation requires ganglia to be installed using the source code. 
 
   I would be glad to hear comments, bug reports and, even better, to 
 receive patches.
 
   Miguel Tavares
 
   
 -- 
 
 Until they become conscious they will never rebel,
  and until after they have rebelled they cannot become conscious.
 
 - George Orwell's 1984 - 



-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] A Java Virtual Machine probe

2006-03-28 Thread Ben Hartshorne
On Tue, Mar 28, 2006 at 08:02:32PM +0100, José Miguel Pereira Tavares wrote:
 
   Hi Ben!
 
 On Tuesday, 28 March 2006 19:30, Ben Hartshorne wrote:
  Miguel,
  I look forward to trying this out.  You say 'Installation requires
  ganglia to be installed using the source code.' - I am curious what
  files you require.  I installed ganglia using an RPM, and would like
  to bring in only the files I need.  Is this a bad idea?  What do you
  recommend?  (and why would a compiled C program require sources to
  run with other compiled programs?)
 
   It's needed mostly to linking with several libraries (libganglia, 
 libconfuse, lidapr-0, libmetrics, libgetopthelper) that come with 
 ganglia and are not required to be instaled installed on the host. At 
 least as far as I could tell.

Does it statically or dynamically link against those binaries?  In other
words, do I only need the ganglia src on the machine on which I compile
JVMProbe, or will I need it to run?

I ask because I would like to deploy to ~30 hosts (to see how the stats
run on my cluster), and if I can avoid deploying the sources to all
those hosts, I would like it better.  :)

One suggestion I have so far - you mention that though the tool is
called JVMProbe, it can be used on other applications.  however, the
stats that it sends to ganglia are all named JVM_foo, which means that
it cannot be used to monitor two different applications on the same
host.  Perhaps you could include an option to include the name of the
process it's watching as part of the metric name?  For example, when
watching java, use JVM_java_foo as the metric name.  If you then also
wanted to watch abcApp on the same host, it would report those metrics
as JVM_abcApp_foo and not collide namespace.  

One bug report - After compilation, I ran 'sudo ./JVMProbe -d' to see if
it would actually work.  I got the following output:
  - [0] -
  - [1] -
  - [2] -
  - [3] -

and then I cancelled the process.  I was confused - was this correct
output?  despite the fact that the --help option said that the default
named of the process to watch was 'java', I added '-n java' and then
got:
 - [0] -
 JVM_taks = 39 tasks
 JVM_avgUse = 98 %
 JVM_highUse = 12 %
 JVM_vmPeak = 0 kB
 JVM_vmSize = 1274096 kB
 JVM_vmRSS = 765428 kB  
 JVM_vmData = 1201280 kB
 JVM_vmStk = 2036 kB
 JVM_vmExe = 56 kB
 JVM_vmLib = 70228 kB   
 - [1] -
 JVM_taks = 39 tasks
 JVM_avgUse = 98 %
 JVM_highUse = 12 %
 JVM_vmPeak = 0 kB
 JVM_vmSize = 1274096 kB
 JVM_vmRSS = 765428 kB  
 JVM_vmData = 1201280 kB
 JVM_vmStk = 2036 kB
 JVM_vmExe = 56 kB
 JVM_vmLib = 70228 kB   

ahhh.  Much better.  I don't know why it didn't work without the '-n'
flag.  

Many Many thanks for writing this module!  though I give you errors
only, the fact is that it works for me, nearly out of the box, and is
successfully reporting JVM stats within my framework!  :)

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] A Java Virtual Machine probe

2006-03-28 Thread Ben Hartshorne
Miguel,

can this probe get things like stats on the garbage collector (avg time
spent in GC, etc.)

Thanks,

-ben

On Tue, Mar 28, 2006 at 05:17:29PM +0100, José Miguel Pereira Tavares wrote:
 
   Hi all!
 
   Sometime ago I enquired about the existence of a probe that would give 
 information on a running Java Virtual Machine (JVM). Unfortunately it's 
 a Linux only probe (for now at least).
 
   Now I am happy to present to the community a probe developed in C that 
 can monitors a JVM and report to a gmond. It's publishing of metrics is 
 similar to that of gmetric.
 
   It works by parsing the info at /proc/pid/status of the Linux tasks 
 relevant to the JVM (depending on the kernel the status file reports 
 just one process with threads or a set of tasks that represent 
 process/threads).
 
   Although this probe is now used for monitoring Java services it can be 
 used to monitor any other kind of process that has a long time span. 
 It's a kind of process oriented metrics.
 
   It was done in C for the usual question on intrusiveness. Memory 
 footprint is around 1.5 Mb Vss and it uses 0.1% of the CPU with 10 
 seconds of interval between samples.
 
   This probe can be downloaded at:
 http://student.dei.uc.pt/~mtavares/index.php?content=software
 
   Installation requires ganglia to be installed using the source code. 
 
   I would be glad to hear comments, bug reports and, even better, to 
 receive patches.
 
   Miguel Tavares
 
   
 -- 
 
 Until they become conscious they will never rebel,
  and until after they have rebelled they cannot become conscious.
 
 - George Orwell's 1984 - 



-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] uptime metric/graph

2006-03-15 Thread Ben Hartshorne
On Wed, Mar 15, 2006 at 11:17:07AM -0500, Richard Lefebvre wrote:
 Has anyone created an uptime metric/graph? It would be a great stat to 
 collect to see how often a system is rebooted.

Richard,

the Time and String metrics already report uptime (and boot time).
Were you imagining something other than that?

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Documentation

2006-02-17 Thread Ben Hartshorne
On Thu, Feb 16, 2006 at 07:08:55PM -0800, Bernard Li wrote:
 Well somebody needs to update that doc though - it seems pretty outdated.

hmm...  I did just volunteer myself, didn't I?  I didn't realize it was
out of date.  I really wish I could update it, but I have neither the
time nor the understanding of how things work - I was learning from it!
;)

::sigh::  ok, well...  

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


[Ganglia-general] Documentation

2006-02-16 Thread Ben Hartshorne
In a recent thread (Ganglia 3.0.2 runtime error on AIX 5.1), Raymond
Pete pointed the thread to the Ganglia documentation on the ucsf server: 
 
http://www.msg.ucsf.edu/local/ganglia/ganglia_docs/

A part of this document says that 'the latest version of this document
can be found on the ganglia documentation page' and links to
http://ganglia.sourceforge.net/docs/  I cannot find the document
referenced on the UCSF page on sourceforge, which is a real travesty
because the UCSF document is 10 times better than what's on the
sourceforge docs page.  

Could the webmaster of the sourceforge project include a copy of the
UCSF docs in the documentation section of the web page?

-ben


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] config file confusion

2006-02-09 Thread Ben Hartshorne
On Thu, Feb 09, 2006 at 11:13:15AM -0700, Ian Cunningham wrote:
 To expand on what Jason wrote, the name doesn't actually decide which 
 cluster the node's data gets included in. If you are using multicast as 

so what is the name used for?  If you're defining the name/port
combination on the server running gmetad, would it have any effect to
have a different name on each host?  Let's say you have three hosts on
each of three different multicast ports, and they're named a1, a2, a3,
b1, b2, b3, c1, c2, c3.  I'm not really clear on how the grid/cluster
naming thing would happen.  The UI shows grid/cluster/host; under what
conditions would you see each of the above names?  

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] librrd.so.0

2006-02-08 Thread Ben Hartshorne
On Wed, Feb 08, 2006 at 11:36:08AM -0900, [EMAIL PROTECTED] wrote:
 List,
 
 This is driving me crazy. We have just spun up a small cluster running
 Oscar 3.1. We decided to run ganglia to get a feel for load balance,
 system and network behavior, and possible bottlenecks. We are running
 Redhat Fedora Core 3.

I too run FC3.  My copy of librrd.so.0 comes from /usr/lib/librrd.so.0
and is in the package rrdtool-1.0.49-3 (which came as an RPM) from
ftp://rpmfind.net/linux/fedora/extras/3/i386/rrdtool-1.0.49-4.fc3.i386.rpm

HTH,

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


Re: [Ganglia-general] Pointers on architecting a largescale ganglia setup??

2006-01-31 Thread Ben Hartshorne
On Tue, Jan 31, 2006 at 12:15:19AM -0800, Martin Knoblauch wrote:
 
  just in case you did not know:
 
  http://ganglia.sourceforge.net/gmetric/
 
  Everyone is invited to contribute to the repository.

Martin,

I believe someone else has pointed out that submissions have been closed
(and apparently for a very long time...).

I found most of the scripts there not quite right for what I wanted to
do, so I wrote my own.  

I have put them up at http://cryptio.net/~ben/ganglia/ for your
cunsumption.  They include
* disk - measures disk IO (per disk as well as cumulative)
* network - reports per-interface stats (which I combined in a ganglia
  report to show all on one graph - fantastic for frontend/backend stuff)
* mysql - reports queries per second as well as broken slow queries
* sensors - CPU temp. et al for Tyan motherboards (may work for others)

There is also a crontab file there for /etc/cron.d/ that calls them
every two minutes and includes the (with this list's help) fixed
num-users metric:
*/2 * * * * root /usr/bin/gmetric --name=users --value=`who | wc -l` 
--type=int16

One thing I like about these scripts is that they do a fair bit of error
checking, so if something happens that might cause them to fail every
two minutes, you don't get 100 messages in your inbox the next morning.
For example, if mysqld dies on an unimportant box, you don't want to be
inundated with messages.  

HTH,

-ben

p.s.  these scripts have been written for a redhat-based linux
installation (Fedora, CentOS, etc.).  I don't know how portable they
are.  I expect not very much.  :)


-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature


[Ganglia-general] intermittent blanks in graphs

2006-01-23 Thread Ben Hartshorne
Hi,

I have been running ganglia for most of the last year, quite happily.
My hosts are configured to send unicast data to a single gmetad server.

Recently, large portions of the cluster's graphs are empty.  A sample is
shown at http://cryptio.net/~ben/ganglia/blank_graphs.png  Notice that
not all hosts are missing data (Burgertime, for example, has all the
data there).

I thought it was due to high load, because I first noticed it when the
gmetad server was being hammered by a separate process.  But it has long
since recovered, and I have not seen the graphs recover, but they have
in fact gotten worse.

I was running 3.0.1, and tried upgrading to 3.0.2 on the off chance it
would fix something, but it did not.  I have since downgraded the webui
because I have made some changes[*] and I don't want to spend the time to
migrate them just now.  :)  

When I go into the page for a single host and click on the 'gmetrics'
link, I find that all of my metrics have a record of being recieved
within the last two minutes (my time period).  And yet, their graphs
show up empty.

Any thoughts?  What logs should I be looking at?  

I am running on a Fedora Core 3 system, with version 3.0.1 (now 3.0.2).
I don't think I've made any gross changes to the environment within the
last week, which is the time period in which all this annoyance has
started.  The only think I can say is that the beginning of this
strangeness coincides with a brief (12-hr) period of intense load on the
gmetad server.

Thanks,

-ben

[*] for those interested - I added an 8-hour and 3-day view; I find the
8-hour view the most useful by far.  I also changed the size of the
graphs to fit my 20 screen.  Finally, I added a Disk summary graph, in
addition to the Load, CPU, Memory, and Network.  Is there any interest
in patching these into the source?

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net


signature.asc
Description: Digital signature