2013/12/8 Chris Burroughs <chris.burrou...@gmail.com>:
> On 12/07/2013 03:22 PM, Devon H. O'Dell wrote:
>>
>> We were polling every 10 seconds, and it was taking over 2 minutes to
>> finish parsing the XML (which includes writing the RRDs). With my
>> changes, the 10 second poll is feasible.
>>
>
> Parse, or fetch & parse?  We have a simple python script we use to generate
> meta metrics about gmetad (total hosts, metrics, clusters, etc).  It can
> parse and do that counting on a 170 MiB file in about 15 seconds.  2 minutes
> would seem to indicate something very wrong with how gmetad's C code is
> handling xml, or you have GBs of xml ;-)

The actual XML parsing does not take that long. But more on that in a sec.

>>>> Initially, I spent a large amount of time using the `perf` utility to
>>>> attempt to
>>>> find the bottleneck in the interactive `gmetad` service. I found that
>>>> the hash
>>>> table implementation in `gmetad` leaves a lot to be desired: apart from
>>>> very
>>>> poor behavior in the face of concurrency, it also is the wrong
>>>> datastructure to
>>>> use for this purpose. Unfortunately, fixing this would require rewriting
>>>> large
>>>> swaths of Ganglia, so this was punted. Instead, Vlad suggested we simply
>>>> reduce
>>>> the number of summarized metrics by explicitly stating which metrics are
>>>> summarized.
>>>
>>>
>>> I strongly suspect a lot of blame should fall on the hash table.  I
>>> think it's likely the cause of the hangs we have observed in
>>> https://github.com/ganglia/monitor-core/issues/47 and Daniel Pocock has
>>> seen problems with it as far back as 2009.
>>
>>
>> Most of the user CPU time is indeed spent in the hash table, but most
>> of the CPU time in general ends up being spent writing RRDs. This is
>> due to filesystem locking in the kernel generated by the stat(2) calls
>> every time we want to write an RRD, as well as the poor performance of
>> straight librrd. (But I'll get to that in a minute in the rrdcached
>> section.)
>>
>
> This is a simple `perf top -p $PID` on one of of our gmetad nodes
>
> Samples: 1M of event 'cycles', Event count (approx.): 64115959770
>   6.59%  libexpat.so.1.5.2          [.] 0x0000000000011b8d
>   4.77%  libganglia-3.6.0.so.0.0.0  [.] hashval
>   2.62%  [kernel]                   [k] __d_lookup
>   2.21%  [kernel]                   [k] _spin_lock
>   2.14%  libc-2.12.so               [.] vfprintf
>   1.61%  librrd.so.4.2.0            [.] process_arg
>   1.54%  libganglia-3.6.0.so.0.0.0  [.] hash_lookup
>   1.46%  [kernel]                   [k] __link_path_walk
>   1.16%  libc-2.12.so               [.] __GI_____strtod_l_internal
>   1.11%  libc-2.12.so               [.] memcpy
>   1.08%  libc-2.12.so               [.] _int_malloc
>
> So I suppose my intuition about xml parsing expense is off.  I have not used
> perf as much as I should, if we were seeing similar rrd writing contention
> should I literally see "stat" near the top?

Ah, so to see what's really going on:

perf record -e cpu-clock -g -p $PID

Let that run for a minute or two. Then:

perf report --sort=comm,dso,symbol -G

If you don't have cpu-clock, cycles is OK, but you definitely are
going to want to see the callgraph. The time in XML is mostly writing
RRDs and you only see that digging down into the chain.

>>> Most of my experience is with python and java which have fancy abstract
>>> base classes, collection hierarchies, and whatnot.  Even  though a hash
>>> table does not have the best properties, wouldn't it be relativity easy
>>> to drop in a better one?
>>
>>
>> Yes, this is something I intend to do soon. I'm not sure what kind of
>> portability targets are required for gmetad; I was intending to build
>> one on top of Concurrency Kit's excellent hash set.
>>
>
> The wiki says 'Ganglia runs on Linux (i386, ia64, sparc, alpha, powerpc,
> m68k, mips, arm, hppa, s390), FreeBSD, NetBSD, OpenBSD, DragonflyBSD, MacOS
> X, Solaris, AIX, IRIX, Tru64, HPUX and Windows NT/XP/2000/2003/2008 making
> it as portable as it is scalable.', although I'm not sure if that accurately
> reflects current targets or not.

I can make my CK stuff optional. I've made some pretty good
improvements in the hash table without fully replacing it, but I don't
know how it will run on embedded architectures like m86k, mips, or
arm. I also don't really have any interest in any architectures other
than amd64 and ARM. (Recent SPARC or PPC might be interesting, but I
don't have access to them.)

>>> Anyway we have had much better luck with tuning the page cache and
>>> disabling fsync for gmetad.
>>> http://oss.oetiker.ch/rrdtool-trac/wiki/TuningRRD  Adminitedly at least
>>> some of the problems we had with rrdcached could have been due to the
>>> issues you have identified.
>>
>>
>> Even with librrd using mmap(2) and friends, there's a lot of system
>> call overhead as the RRD file is open(2)ed, locked, and then written
>> to. gmetad also serializes every thread around the calls to librrd, so
>> if you are updating hundreds of thousands of metrics, you have
>> contention from every data_thread on the rrd_mutex. The new code using
>> rrdcached is wait-free from gmetad's perspective: every thread can
>> make progress as long as it can talk to rrdcached. I know that
>> librrd_th exists; I started with that and it simply didn't work
>> (1.4.7).
>>
>
> I did not realize that rrdtool required a global write lock.  That is
> disappointing.

Yeah :(

>> I'm also not convinced that increasing the dirty page flush delay will
>> help better than rrdcached does. Ganglia's RRD updates are very small
>> and tons of them fit on a single page. This would fix things for a
>> while, until you actually needed to do a page flush (which you've
>> delayed for a long time). At that point, those pages are going to be
>> locked and writes to the memory-mapped region will block for a longer
>> time while more pages are being flushed. I would imagine the symptoms
>> of this to be random spots of missing / less granular metrics because
>> the time to write them exceeded the validity of the metric. And
>> indeed, when I measured, most of the overall execution time was gmetad
>> waiting on locks in the VM and VFS layers for stat(2), read(2), and
>> write(2) calls. Based on your introduction paragraph, it sounds like
>> that's the kind of behavior you're seeing.
>>
>> Combined with the fact that there's so much lock contention from every
>> data thread trying to write metrics, even if I'm wrong about the whole
>> preceding paragraph, the current approach simply unnecessarily starves
>> writers. Talking natively to rrdcached removes that starvation and
>> reduces the need for VFS/VM tuning (that will affect other things
>> running on the same system).
>>
>
> If I'm understanding this correctly the delaying dirty page flush defers and
> amortizes disk io, but rrdcached also allows syscalls to be deferred and
> amortized  (ie call stat(2) only when rrdcached is ready to flush)?  And it
> also doubles as a queue to decouple blocking file system calls from gmetad?
>
> How does rrdcached avoid bogging down under the same global librrd lock
> contention?

Rrdcached uses the librrd_th API. Maybe I was just using it
incorrectly. But even if it didn't, it would succeed in being able to
run multiple rrd_updates under the same lock. But it does have a
global cache lock.

>>>> In the process of doing this, I noticed that ganglia used a particularly
>>>> poor
>>>> method for reading its XML metrics from gmond: It initialized a
>>>> 1024-byte
>>>> buffer, read into it, and if it would overflow, it would realloc the
>>>> buffer with
>>>> an additional 1024 bytes and try reading again. When dealing with XML
>>>> files many
>>>> megabytes in size, this caused many unnecessary reallocations. I
>>>> modified this
>>>> code to start with a 128KB buffer and double the buffer size when it
>>>> runs out of
>>>> space. (I made a similar change to the code for decompressing gzip'ed
>>>> data that
>>>> used a similar buffer sizing paradigm).
>>>
>>>
>>> This sounds like a solid find.  I'm a little worried about the doubling
>>> though since as you said the responses can get quiet large.  Is there a
>>> max buffer size?
>>
>>
>> No, because we already have to read the whole XML buffer before we can
>> process it. The XML parser has to validate teh whole document. There
>> might be a streaming interface to libxml, but I'd rather fix the usage
>> of XML than fix the XML parsing to be better.
>
>
> Sorry I meant to say a max buffer increment size.  The scenario I was
> worrying about (perhaps unnecessarily) was something along the lines a 256
> MB allocation to read the last MB of a 129 MB input.

Ah. I guess I could add something like that. I don't think the read(2)
will buffer that much data anyway.

>>>>     * changing the data serialization format from XML to one that is
>>>> easier /
>>>> faster to parse,
>>>
>>>
>>> A common request is to support multiple formats (json).  I admit I'm a
>>> little surprised that the actual parsing is a significant cost relative
>>> to all of the other work that has to be done.
>>
>>
>> It might turn out to be not much of one once the hash table is fixed.
>> But parsing XML isn't cheap compared to other formats.
>>
>
> Actually based on `perf top` my intuition might have been wrong about the
> relative expensive of parsing.  I think all of our internal tools that
> consume the gmetad xml use a streaming parser.

I'd really, really, really be surprised if you find that the expense
wasn't in write_data_to_rrd or whatever it is.

>>>>     * using a different data structure than a hash table for metrics
>>>> hierarchies
>>>> (probably a tree with metrics stored at each level in contiguous memory
>>>> and an
>>>> index describing each metric at each level)
>>>
>>>
>>> As you said this is a large change but likely a very beneficial one.  I
>>> think it would be particular interesting if we could pre-generate
>>> indexes that would be useful to gweb.
>>
>>
>> Could you expound on this? I'm not sure what this means.
>>
>
> The web ui needs to be able to map hosts to clusters to metrics.  For
> example to satisfy an aggregate graph that uses host regular expressions.
> Currently this involves querying gmetad and caching a rather large result
> set.
>
> https://github.com/ganglia/ganglia-web/blob/6f1d82d5bb5e2b71b6322fd63efa1d57887b3681/conf_default.php.in#L230
>
> Also the php memory limit needs to be increased to very large values (512
> MB) for stacked graphs and such to function.  My front-end friends tell me
> is likely a red flag we are doing a lot in php code that might be better
> served be pushing some work down the stack.
>
>
>>> Down the road a new data structure might also make it easier to support
>>> keeping the last n data points so that we didn't have to worry about the
>>> polling time interval dance so much.  That's more of a new feature than
>>> directly relevant to these performance issues though.
>>
>>
>> I think that's only a problem because of the very weak thread-safety
>> semantics of the current codebase. By that I mean it's thread safe,
>> but there's so much copying and swapping of data going on that
>> additional threads don't help out so much. (Indeed, I think most of
>> the problems we had would go away if we just ran a single gmetad for
>> every data source, but that's no fun).
>>
>
> We might be talking about slightly different cases here.  With a series of
> independent processes sending or polling on fixed intervals since startup:
>  * app/gmetric --> gmond --> gmetad --> gmetad-grid
>
> I find it very difficult to reason about how often each poll frequency
> should be to not miss any updates.  This is particularity tricky because
> each metric can have it's own update frequency, and as a far as I know the
> sending is not synchronized. If the last n values for each metric were kept,
> then gmetad could polling could be based on how often the user wanted to
> deal with rrd/ui updates instead of chasing the most frequently updated
> metric.  (If you are writing a new serialization format anyway the ability
> to tell gmond to only give metrics that have updated since time t could
> reduce network transfer.)

I see what you mean. I'll think about this, too. A delta-friendly data
structure changes things a little :)

--dho

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to