Hello everyone, For few weeks now we have had performance issues due to growth of our monitoring setup. One of my colleagues Devon O'Dell volunteered to help and below is an e-mail of his findings. We'll submit a pull request once we are comfortable with the changes https://github.com/dhobsd/monitor-core/compare/master Vladimir ==== Forwarded message ==== Vlad emailed some time ago
about issues we're having with Ganglia performance. Over the past
couple weeks, I spent some time figuring out how Ganglia works and
attempting to identify / solve the performance issues. The issue
is fundamentally one of scale: the number of metrics we monitor
times the number of servers we have (times the number of metrics
we sum!) ends up being a large number.
The Ganglia core is
comprised of two daemons, `gmond` and `gmetad`. `Gmond` is
primarily responsible for sending and receiving metrics; `gmetad`
carries the hefty task of summarizing / aggregating the
information, writing the metrics information to graphing utilities
(such as RRD), and reporting summary metrics to the web front-end.
Due to growth some metrics were never updating, and front-end
response time was abysmal. These issues are tied directly to
`gmetad`.
In our Ganglia setup, we
run a `gmond` to collect data for every machine and several
`gmetad` processes:
* An interactive `gmetad`
process is responsible solely for reporting summary statistics to
the web interface.
* Another `gmetad` process
is responsible for writing graphs.
Initially, I spent a large
amount of time using the `perf` utility to attempt to find the
bottleneck in the interactive `gmetad` service. I found that the
hash table implementation in `gmetad` leaves a lot to be desired:
apart from very poor behavior in the face of concurrency, it also
is the wrong datastructure to use for this purpose. Unfortunately,
fixing this would require rewriting large swaths of Ganglia, so
this was punted. Instead, Vlad suggested we simply reduce the
number of summarized metrics by explicitly stating which metrics
are summarized.
This improved the
performance of the interactive process (and thus of the web
interface), but didn't address other issues: graphs still weren't
updating properly (or at all, in some cases). Running `perf` on
the graphing `gmetad` process revealed that the issue was largely
one of serialization: although we had thought we had configured
`gmetad` to use `rrdcached` to improve caching performance, the
way that Ganglia calls librrd doesn't actually end up using
rrdcached -- `gmetad` was writing directly to disk every time,
forcing us to spin in the kernel. Additionally, librrd isn't
thread-safe (and its thread-safe API is broken). All calls to the
RRD API are serialized, and each call to create or update not only
hit disk, but prevented any other thread from calling create or
update. We have 47 threads running at any time, all generally
trying to write data to an RRD file.
Modifying gmetad to call
the proper librrd function to call into rrdcached helped a little,
but most threads were still spending all their time spinning on
locks: although we were writing to rrdcached now, we were doing so
over a single file descriptor to a unix domain socket. This forced
the kernel to serialize the reads and writes from all the
different threads to the single file descriptor. The only reason
we gained any performance was due to not hitting disk.
To solve this problem, I
made rrdcached listen on a TCP socket and gave every thread in
gmetad its own file descriptor to connect to rrdcached. This
allowed every thread to write to rrdcached without locking for
updates (creating new RRDs still requires holding a lock and
calling into librrd). This worked, and I suspect that we'll be
able to move forward for some time with these changes. They are
running on (censored) right now, and we'll leave them running for
a while to make sure they're good before pushing the patches
upstream.
In the process of doing
this, I noticed that ganglia used a particularly poor method for
reading its XML metrics from gmond: It initialized a 1024-byte
buffer, read into it, and if it would overflow, it would realloc
the buffer with an additional 1024 bytes and try reading again.
When dealing with XML files many megabytes in size, this caused
many unnecessary reallocations. I modified this code to start with
a 128KB buffer and double the buffer size when it runs out of
space. (I made a similar change to the code for decompressing
gzip'ed data that used a similar buffer sizing paradigm).
After all these changes,
both the interactive and RRD-writing processes spend most of their
time in the hash table. I can continue improving Ganglia
performance, but most of the low hanging fruit is now gone; at
some me point it will require:
* writing a version of
librrd (this probably also means changing the rrd file format),
* replacing the hash table
in Ganglia with one that performs better,
* changing the data
serialization format from XML to one that is easier / faster to
parse,
* using a different data
structure than a hash table for metrics hierarchies (probably a
tree with metrics stored at each level in contiguous memory and an
index describing each metric at each level)
* refactoring gmetad and
gmond into a single process that shares memory
These are all longer-term
projects, but I think that they'll probably eventually be useful.
|
------------------------------------------------------------------------------ Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers