Hi Steve, Most likely, your problems are caused by disk I/O activity because gmetad it trying to update tens of thousands of rrd files every 15 seconds. I have switched to using tmpfs and have no problems monitoring a little over 1,000 nodes with a single gmetad collector node. The computer running gmetad is a dual 1GHz PIII with 1Gig of RAM, and typically has a load under 1.0. I am using about 350Megs of RAM to monitor the thousand nodes, so you will probably have to allocate a pretty big chunk of memory for your three thousand nodes. Just put an entry similar to this into /etc/fstab and mount it:
none /var/lib/ganglia/rrds tmpfs \ size=500M,mode=755,uid=nobody,gid=nobody 0 0 ~Jason On Wed, 2003-09-17 at 19:45, Steve Gilbert wrote: > Hi folks, > > I don't know if I'm just trying to push Ganglia to more than it can handle > or if I'm doing something wrong, but no matter how I design my Ganglia > structure, gmetad seems to always crush the machine where it runs. Here's > an overview of my environment: > > Ganglia 2.5.4 > All hosts involved are running RedHat 7.2 > RRDtool version 1.0.45 > > I have 16 subnets, each with 200 machines give or take a few. I estimate > around 3000 nodes total. Some of these are dual P3, some are single P4, and > a few random Xeon and Itanium nodes. Every node is running gmond, and > that's running fine. > > Each subnet has a "master" node that is a dual P3 1.3GHz. This box provides > DNS, NIS, and static DHCP for the subnet. Normal load on these machines is > very, very minimal. > > My first attempt was to set up a single dedicated Ganglia machine running > gmetad, Apache, and the web frontend. In this machine's gmetad.conf file, I > listed each of the "master" nodes in the subnets as data sources. I thought > having one box collect all the data and store the RRD files would be great. > Well, this was a bad idea...the box (a P4 with 2GB RAM) was absolutely > crushed...load shot up to 8.5, and all the graphs continually had gaps in > them. > > So my next attempt to was to install gmetad on each of the "master" nodes. > I would have this gmetad collect data for the subnet, and then run another > gmetad on my Ganglia web machine to just talk to these 16 other gmetads. I > don't really like having to now backup 16 machines, but I've had problems > before with trying to store RRD files on an NFS mount, so I decided not to > try that. This isn't working all that great, either...the gmetad on these > "master" nodes (collecting data from ~200 hosts each) is also causing a > pretty high load...the boxes now stay around 2-3 load points all the time > and sometimes slows down other operations on the box. > > Am I doing something wrong, or is gmetad really this much of a resource hog? > Anyone else trying to use Ganglia to monitor 3000 machines? Am I asking too > much? Thanks for any insight. > > Steve Gilbert > Unix Systems Administrator > [EMAIL PROTECTED] > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general >