Chris, all,

I'm in the same legal position as Alex (In addition I'm not allowed to use my 
work email address and rely on my ISP email service is only up intermittently - 
but webmail is blocked while I'm at work).

However I'd like to share my experience.

Our hierarchy (by geography) is as follows:

Global gmetad
        |
        |
Regional gmetad (collecting every 60 seconds)
        |
        |
send_receive gmonds (from 10 ~ 400 nodes) |
        |
nodes (~10k) sending udp unicast


I agree with Alex's notions though we do not use gmetrics functionality, for 
various reasons, mainly around the impact of loading scripts and the 
manageability around using this solution.

What we've implemented is a global gmond web server configuration engine.  The 
architecture of this is an oracle database with a web front end which controls 
our ganglia architecture.

A walk through of functionality:

On the nodes (gmond package)
With each gmond package I include a perl script which sends variables taken from the local host (fqdn, interface) via HTTP to the web server.
On the server (gmetad package)
The PHP cgi script sitting on the web server will return to the node a gmond.conf from which specifics of which gmonds it will report to are included.
Depending on the fqdn the PHP cgi script gets it will either (1) enter the host 
into a default gmond (updating the database) or (2) send its configuration file 
back with the predesigned gmonds (taken from the database) it will report to.

Finally on the node end it starts the gmond (last phase of a package install) 
with the newly acquired gmond.conf.

Some other points about the architecture.
- It uses the TemplatePower engine.
- A cronjob checks that the gmond.conf is up to date every day at 12 pm local 
time (there is no DoS since we've included a timeout on the node side)
- The database tables are very simple.
- Anyone can bulk update the database through a simple DBD perl script
- A web front end for the database table allows us to easily view the send_receive gmonds and the send gmonds, which enables us to understand and manage our environment with very low overhead.
That's all I can think of for now....  Any questions/queries are welcome.

Unfortunately I'm in the same position as Alex, would like to share this but am 
not aware if I can at this time.

Thanks,

Ole


Chris Croswhite wrote:

Alex,

Thanks for the great information.  I'll check out the Jan email thread
and then follow up with more questions.

BTW, the script statement did come across rather badly, sorry about that
(and after all that PC training I was required to take!)

Thanks
Chris

On Thu, 2006-02-23 at 10:36, Alex Balk wrote:
Chris Croswhite wrote:

Alex,

Yeah, I already have a ton of questions and need some pointers in large
scale deploys (best practices, do's, dont's, etc,).

Till I get the legal issues out of the way, I can't share the scripts...
What I can do, however, is share the ideas I've implemented, as those
were developed outside the customer environment and were just spin-offs
of common concepts like orchestration, federation, etc.
Here's a few things:

   * When unicasting a tree hierarchy of nodes could provide useful
     drill-down capabilities.
   * Most organization already have some form of logical grouping for
     cluster nodes. For example: faculty, course, devel-group, etc.
     Within those groups one might find additional logical
     partitioning. For example: platform, project, developer, etc.
     Using these as the basis for constructing your logical hierarchy
     provides real-world basis for information analysis, saves you the
     trouble of deciding how to construct your grid tree and prevents
     gmond aggregators from handling too many nodes (though I've found
     that a single gmond can store information for 1k nodes without
     noticeable impact on performance).
   * Nodes will sometimes move between logical clusters. Hence,
     whatever mechanism you have in place has to detect this and
     regenerate its gmond.conf.
   * Using a central map which stores "cluster_name gmond_aggregator
     gmetad_aggregator" will save you the headache of figuring out who
     reports where, who pulls info from where, etc. If you take this
     approach be sure to cache this file locally or put it on your
     parallel FS (if you use one). You wouldn't want 10k hosts trying
     to retrieve it from a single filer.
   * The same map file approach can be used for gmetrics. This allows
     anyone in your IT group to add custom metrics without having to be
     familiar with gmetric and without having to handle crontabs. A
     mechanism which reads (the cached version of) this file could
     handle inserting/removing crontabs as needed.

Also, check out the ganglia-general thread from January 2006 called
"Pointers on architecting a large scale ganglia setup".


I would love to get my hands on your shell scripts to figure out what
you are doing (the unicast idea is pretty good).

Okay, that sounds almost obscene ;-)


Cheers,
Alex

Chris


On Thu, 2006-02-23 at 09:35, Alex Balk wrote:
Chris,


Cool! Thanks!

If you need any pointers on large-scale deployments, beyond the
excellent thread that was discussed here last month, drop us a line. I'm
managing Ganglia on a cluster of about the same size as yours, spanning
multiple sites.


I've developed a framework for automating the deployment of Ganglia in a
federated mode (we use unicast). I'm currently negotiating the
possibility of releasing this framework to the Ganglia community. It's
not the prettiest piece of code, as it's written in bash and spans a few
thousands lines of code (I didn't expect it to grow into something like
that), but it provides some nice functionality like map-based logical
clusters, automatic node migration between clusters, map-based gmetrics,
and some other candies.

If negotiations fail I'll consider rewriting it from scratch in perl on
my own free time.


btw, I think Martin was looking for a build on HP-UX 11...


Cheers,

Alex


Chris Croswhite wrote:

This raises another issue, which I believe is significant to the
development process of Ganglia. At the moment we don't seem to have
(correct me if I'm wrong) official testers for various platforms.
Maybe we could have some people volunteer to be official beta testers?
We wouldn't have to have a release out the door without properly
testing it under most OS/archs.
The company I work for is looking to deploy ganglia across all compute
farms, some ~10k systems.  I could help with beta testing on these
platforms:
HP-UX 11+11i
AIX51+53
slowlaris7-10
solaris10 x64
linux32/64 (SuSE and RH)

Just let me know when you have a new candidate and I can push the client
onto some test systems.

Chris




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers




Reply via email to