Hello everyone,

For few weeks now we have had performance issues due to growth of our monitoring setup. One of my colleagues Devon O'Dell volunteered to help and below is an e-mail of his findings.

We'll submit a pull request once we are comfortable with the changes

https://github.com/dhobsd/monitor-core/compare/master

Vladimir

==== Forwarded message ====

Vlad emailed some time ago about issues we're having with Ganglia performance. Over the past couple weeks, I spent some time figuring out how Ganglia works and attempting to identify / solve the performance issues. The issue is fundamentally one of scale: the number of metrics we monitor times the number of servers we have (times the number of metrics we sum!) ends up being a large number.

The Ganglia core is comprised of two daemons, `gmond` and `gmetad`. `Gmond` is primarily responsible for sending and receiving metrics; `gmetad` carries the hefty task of summarizing / aggregating the information, writing the metrics information to graphing utilities (such as RRD), and reporting summary metrics to the web front-end. Due to growth some metrics were never updating, and front-end response time was abysmal. These issues are tied directly to `gmetad`.

In our Ganglia setup, we run a `gmond` to collect data for every machine and several `gmetad` processes:

 * An interactive `gmetad` process is responsible solely for reporting summary statistics to the web interface.
 * Another `gmetad` process is responsible for writing graphs.

Initially, I spent a large amount of time using the `perf` utility to attempt to find the bottleneck in the interactive `gmetad` service. I found that the hash table implementation in `gmetad` leaves a lot to be desired: apart from very poor behavior in the face of concurrency, it also is the wrong datastructure to use for this purpose. Unfortunately, fixing this would require rewriting large swaths of Ganglia, so this was punted. Instead, Vlad suggested we simply reduce the number of summarized metrics by explicitly stating which metrics are summarized.

This improved the performance of the interactive process (and thus of the web interface), but didn't address other issues: graphs still weren't updating properly (or at all, in some cases). Running `perf` on the graphing `gmetad` process revealed that the issue was largely one of serialization: although we had thought we had configured `gmetad` to use `rrdcached` to improve caching performance, the way that Ganglia calls librrd doesn't actually end up using rrdcached -- `gmetad` was writing directly to disk every time, forcing us to spin in the kernel. Additionally, librrd isn't thread-safe (and its thread-safe API is broken). All calls to the RRD API are serialized, and each call to create or update not only hit disk, but prevented any other thread from calling create or update. We have 47 threads running at any time, all generally trying to write data to an RRD file.

Modifying gmetad to call the proper librrd function to call into rrdcached helped a little, but most threads were still spending all their time spinning on locks: although we were writing to rrdcached now, we were doing so over a single file descriptor to a unix domain socket. This forced the kernel to serialize the reads and writes from all the different threads to the single file descriptor. The only reason we gained any performance was due to not hitting disk.

To solve this problem, I made rrdcached listen on a TCP socket and gave every thread in gmetad its own file descriptor to connect to rrdcached. This allowed every thread to write to rrdcached without locking for updates (creating new RRDs still requires holding a lock and calling into librrd). This worked, and I suspect that we'll be able to move forward for some time with these changes. They are running on (censored) right now, and we'll leave them running for a while to make sure they're good before pushing the patches upstream.

In the process of doing this, I noticed that ganglia used a particularly poor method for reading its XML metrics from gmond: It initialized a 1024-byte buffer, read into it, and if it would overflow, it would realloc the buffer with an additional 1024 bytes and try reading again. When dealing with XML files many megabytes in size, this caused many unnecessary reallocations. I modified this code to start with a 128KB buffer and double the buffer size when it runs out of space. (I made a similar change to the code for decompressing gzip'ed data that used a similar buffer sizing paradigm).

After all these changes, both the interactive and RRD-writing processes spend most of their time in the hash table. I can continue improving Ganglia performance, but most of the low hanging fruit is now gone; at some me point it will require:

 * writing a version of librrd (this probably also means changing the rrd file format),
 * replacing the hash table in Ganglia with one that performs better,
 * changing the data serialization format from XML to one that is easier / faster to parse,
 * using a different data structure than a hash table for metrics hierarchies (probably a tree with metrics stored at each level in contiguous memory and an index describing each metric at each level)
 * refactoring gmetad and gmond into a single process that shares memory

These are all longer-term projects, but I think that they'll probably eventually be useful.

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to