Re: [Ganglia-developers] Gmetad bottlenecks

Chris Burroughs Sun, 08 Dec 2013 12:25:50 -0800

On 12/07/2013 03:22 PM, Devon H. O'Dell wrote:
> We were polling every 10 seconds, and it was taking over 2 minutes to
> finish parsing the XML (which includes writing the RRDs). With my
> changes, the 10 second poll is feasible.
>


Parse, or fetch & parse?  We have a simple python script we use to 
generate meta metrics about gmetad (total hosts, metrics, clusters, 
etc).  It can parse and do that counting on a 170 MiB file in about 15 
seconds.  2 minutes would seem to indicate something very wrong with how 
gmetad's C code is handling xml, or you have GBs of xml ;-)


>>> Initially, I spent a large amount of time using the `perf` utility to 
>>> attempt to
>>> find the bottleneck in the interactive `gmetad` service. I found that the 
>>> hash
>>> table implementation in `gmetad` leaves a lot to be desired: apart from very
>>> poor behavior in the face of concurrency, it also is the wrong 
>>> datastructure to
>>> use for this purpose. Unfortunately, fixing this would require rewriting 
>>> large
>>> swaths of Ganglia, so this was punted. Instead, Vlad suggested we simply 
>>> reduce
>>> the number of summarized metrics by explicitly stating which metrics are 
>>> summarized.
>>
>> I strongly suspect a lot of blame should fall on the hash table.  I
>> think it's likely the cause of the hangs we have observed in
>> https://github.com/ganglia/monitor-core/issues/47 and Daniel Pocock has
>> seen problems with it as far back as 2009.
>
> Most of the user CPU time is indeed spent in the hash table, but most
> of the CPU time in general ends up being spent writing RRDs. This is
> due to filesystem locking in the kernel generated by the stat(2) calls
> every time we want to write an RRD, as well as the poor performance of
> straight librrd. (But I'll get to that in a minute in the rrdcached
> section.)
>

This is a simple `perf top -p $PID` on one of of our gmetad nodes

Samples: 1M of event 'cycles', Event count (approx.): 64115959770
   6.59%  libexpat.so.1.5.2          [.] 0x0000000000011b8d
   4.77%  libganglia-3.6.0.so.0.0.0  [.] hashval
   2.62%  [kernel]                   [k] __d_lookup
   2.21%  [kernel]                   [k] _spin_lock
   2.14%  libc-2.12.so               [.] vfprintf
   1.61%  librrd.so.4.2.0            [.] process_arg
   1.54%  libganglia-3.6.0.so.0.0.0  [.] hash_lookup
   1.46%  [kernel]                   [k] __link_path_walk
   1.16%  libc-2.12.so               [.] __GI_____strtod_l_internal
   1.11%  libc-2.12.so               [.] memcpy
   1.08%  libc-2.12.so               [.] _int_malloc

So I suppose my intuition about xml parsing expense is off.  I have not 
used perf as much as I should, if we were seeing similar rrd writing 
contention should I literally see "stat" near the top?


>> Most of my experience is with python and java which have fancy abstract
>> base classes, collection hierarchies, and whatnot.  Even  though a hash
>> table does not have the best properties, wouldn't it be relativity easy
>> to drop in a better one?
>
> Yes, this is something I intend to do soon. I'm not sure what kind of
> portability targets are required for gmetad; I was intending to build
> one on top of Concurrency Kit's excellent hash set.
>

The wiki says 'Ganglia runs on Linux (i386, ia64, sparc, alpha, powerpc, 
m68k, mips, arm, hppa, s390), FreeBSD, NetBSD, OpenBSD, DragonflyBSD, 
MacOS X, Solaris, AIX, IRIX, Tru64, HPUX and Windows 
NT/XP/2000/2003/2008 making it as portable as it is scalable.', although 
I'm not sure if that accurately reflects current targets or not.

>> Anyway we have had much better luck with tuning the page cache and
>> disabling fsync for gmetad.
>> http://oss.oetiker.ch/rrdtool-trac/wiki/TuningRRD  Adminitedly at least
>> some of the problems we had with rrdcached could have been due to the
>> issues you have identified.
>
> Even with librrd using mmap(2) and friends, there's a lot of system
> call overhead as the RRD file is open(2)ed, locked, and then written
> to. gmetad also serializes every thread around the calls to librrd, so
> if you are updating hundreds of thousands of metrics, you have
> contention from every data_thread on the rrd_mutex. The new code using
> rrdcached is wait-free from gmetad's perspective: every thread can
> make progress as long as it can talk to rrdcached. I know that
> librrd_th exists; I started with that and it simply didn't work
> (1.4.7).
>

I did not realize that rrdtool required a global write lock.  That is 
disappointing.

> I'm also not convinced that increasing the dirty page flush delay will
> help better than rrdcached does. Ganglia's RRD updates are very small
> and tons of them fit on a single page. This would fix things for a
> while, until you actually needed to do a page flush (which you've
> delayed for a long time). At that point, those pages are going to be
> locked and writes to the memory-mapped region will block for a longer
> time while more pages are being flushed. I would imagine the symptoms
> of this to be random spots of missing / less granular metrics because
> the time to write them exceeded the validity of the metric. And
> indeed, when I measured, most of the overall execution time was gmetad
> waiting on locks in the VM and VFS layers for stat(2), read(2), and
> write(2) calls. Based on your introduction paragraph, it sounds like
> that's the kind of behavior you're seeing.
>
> Combined with the fact that there's so much lock contention from every
> data thread trying to write metrics, even if I'm wrong about the whole
> preceding paragraph, the current approach simply unnecessarily starves
> writers. Talking natively to rrdcached removes that starvation and
> reduces the need for VFS/VM tuning (that will affect other things
> running on the same system).
>

If I'm understanding this correctly the delaying dirty page flush defers 
and amortizes disk io, but rrdcached also allows syscalls to be deferred 
and amortized  (ie call stat(2) only when rrdcached is ready to flush)? 
  And it also doubles as a queue to decouple blocking file system calls 
from gmetad?

How does rrdcached avoid bogging down under the same global librrd lock 
contention?

>>> In the process of doing this, I noticed that ganglia used a particularly 
>>> poor
>>> method for reading its XML metrics from gmond: It initialized a 1024-byte
>>> buffer, read into it, and if it would overflow, it would realloc the buffer 
>>> with
>>> an additional 1024 bytes and try reading again. When dealing with XML files 
>>> many
>>> megabytes in size, this caused many unnecessary reallocations. I modified 
>>> this
>>> code to start with a 128KB buffer and double the buffer size when it runs 
>>> out of
>>> space. (I made a similar change to the code for decompressing gzip'ed data 
>>> that
>>> used a similar buffer sizing paradigm).
>>
>> This sounds like a solid find.  I'm a little worried about the doubling
>> though since as you said the responses can get quiet large.  Is there a
>> max buffer size?
>
> No, because we already have to read the whole XML buffer before we can
> process it. The XML parser has to validate teh whole document. There
> might be a streaming interface to libxml, but I'd rather fix the usage
> of XML than fix the XML parsing to be better.

Sorry I meant to say a max buffer increment size.  The scenario I was 
worrying about (perhaps unnecessarily) was something along the lines a 
256 MB allocation to read the last MB of a 129 MB input.

>>>     * changing the data serialization format from XML to one that is easier 
>>> /
>>> faster to parse,
>>
>> A common request is to support multiple formats (json).  I admit I'm a
>> little surprised that the actual parsing is a significant cost relative
>> to all of the other work that has to be done.
>
> It might turn out to be not much of one once the hash table is fixed.
> But parsing XML isn't cheap compared to other formats.
>

Actually based on `perf top` my intuition might have been wrong about 
the relative expensive of parsing.  I think all of our internal tools 
that consume the gmetad xml use a streaming parser.


>>>     * using a different data structure than a hash table for metrics 
>>> hierarchies
>>> (probably a tree with metrics stored at each level in contiguous memory and 
>>> an
>>> index describing each metric at each level)
>>
>> As you said this is a large change but likely a very beneficial one.  I
>> think it would be particular interesting if we could pre-generate
>> indexes that would be useful to gweb.
>
> Could you expound on this? I'm not sure what this means.
>

The web ui needs to be able to map hosts to clusters to metrics.  For 
example to satisfy an aggregate graph that uses host regular 
expressions.  Currently this involves querying gmetad and caching a 
rather large result set.

https://github.com/ganglia/ganglia-web/blob/6f1d82d5bb5e2b71b6322fd63efa1d57887b3681/conf_default.php.in#L230

Also the php memory limit needs to be increased to very large values 
(512 MB) for stacked graphs and such to function.  My front-end friends 
tell me is likely a red flag we are doing a lot in php code that might 
be better served be pushing some work down the stack.

>> Down the road a new data structure might also make it easier to support
>> keeping the last n data points so that we didn't have to worry about the
>> polling time interval dance so much.  That's more of a new feature than
>> directly relevant to these performance issues though.
>
> I think that's only a problem because of the very weak thread-safety
> semantics of the current codebase. By that I mean it's thread safe,
> but there's so much copying and swapping of data going on that
> additional threads don't help out so much. (Indeed, I think most of
> the problems we had would go away if we just ran a single gmetad for
> every data source, but that's no fun).
>

We might be talking about slightly different cases here.  With a series 
of independent processes sending or polling on fixed intervals since 
startup:
  * app/gmetric --> gmond --> gmetad --> gmetad-grid

I find it very difficult to reason about how often each poll frequency 
should be to not miss any updates.  This is particularity tricky because 
each metric can have it's own update frequency, and as a far as I know 
the sending is not synchronized. If the last n values for each metric 
were kept, then gmetad could polling could be based on how often the 
user wanted to deal with rrd/ui updates instead of chasing the most 
frequently updated metric.  (If you are writing a new serialization 
format anyway the ability to tell gmond to only give metrics that have 
updated since time t could reduce network transfer.)



------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Re: [Ganglia-developers] Gmetad bottlenecks

Reply via email to