Long list of observations and thoughts below...

Florian Forster wrote:
Hi Thorsten,

On Fri, Oct 09, 2009 at 04:41:55PM -0700, Thorsten von Eicken wrote:
  
This sounds like collectd not sending updates to rrdcached.  If they
are not in the journal, then rrdcached has not received them.
      
Yes, the question is whether it's collectd's fault or rrdcached's
fault..
    

If RRDCacheD takes too long to answer, the dispatch thread will wait
there and not dequeue any more values from that queue of received and
unparsed packets. If this is the cache, you should see some (linear?)
memory growth of the collectd process. You can also try to forcibly quit
collectd (kill -9) and immediately restart collectd. If the data RRD
files were lagging behind is simply lost, this is a indication of the
data being within collectd and waiting to be sent to RRDCacheD.

(It's not yet possible to “watch” the length of this queue directly.
I'll add some measurements to the Network plugin so we can see what's
going on eventually …)
  
The linear memory growth is very clear. However, there are a number of things that still bug me:

 - collectd+rrdcached were running steady processing ~25'000 tree nodes with ~2'500 updates per second (rrdcached's UpdatesReceived stats counter). I then threw another ~30'000 tree nodes with ~3'000 updates per second at it (this is all real traffic, not a simulation). Due to the way we deal with the creation of the required new rrds this caused very heavy disk activity for a while slowing down collectd and rrdcached so collectd started buffering for ~15 minutes, during which time it grew from ~40MB to just under 300MB, all good and expected so far. It then stayed steady at that size and judging by the rrdcached UpdatesReceived it must have been able to clear its backlog. Then I threw yet another 30'000 tree nodes and corresponding updates at it. At that point, collectd started immediately to grow again linearly to over 600MB. Given that it has more traffic coming at it I expect it to grow larger buffers than previously, but what bothered me is that it started to grow immediately. It's as if the previous 250MB of buffers hadn't been freed (in the malloc sense, I understand that the process size isn't going to shrink). Could it be that there is a bug?

 - if rrdcached is restarted, collectd doesn't reconnect. I know this is the case for TCP sockets but I'm pretty sure I observed it using the unix socket too. This is a problem because restarting collectd looses the data it has buffered while rrdcached was down.

 - the -z parameter is nice, but not quite there yet. I'm running with -w 3600 -z 3600 and the situation after the first hour is not pretty with a ton of flushes followed by a lull and a repeat after another hour. It takes about 4 hours before everything stabilizes and becomes smooth. I'm wondering whether it would be difficult to change to an adaptive rate system, where given a -w 3600 and the current number of dirty tree nodes rrdcached computes the rate at which it needs to flush to disk and then does that. If you think about it, within one collection interval (20s in my case) it would know the total set of RRDs (tree nodes) and they all would be dirty. In my case it would periodically compute the ratio (e.g. 25'000 tree nodes to flush over 3600 seconds = 6.9 flushes per second) and would start flushing the oldest dirty nodes immediately even though they've been dirty for much less than 3600 seconds. Of course rrdcached would need to re-evaluate the flush rate periodically, but if it keeps a running counter of dirty tree nodes that should be pretty easy. All this should put the daemon into a steady state from the very beginning.

  - running with 80-90k tree nodes for a while ended up bringing rrdcached to its knees. What I observe is that over time rrdcached uses more and more cpu and starts seeing page faults. Eventually, rrdached comes to a crawl and neither keeps up with the input (so collectd starts growing) nor manages to maintain its write-rate. The page faults are interesting because no swap space is used (it stays at 64k usage, which is the initial state). The only explanation I've come up with is that at the point where the "working set" of all the RRDs exceeds the amount of memory available (I have 8GB) everything starts degrading. At that point, rrdcached fights against the buffer cache and starts seeing page faults. Its write threads also slow down because now the disk is not just being written but also read (I can see that happening). I assume that once it page-faults the whole process slows down meaning that notjust the queue threads but also the connection threads start slowing down, which then causes collectd to start buffering data and grow -- it grew to >2GB for me! That now puts more pressure on memory and we're in a downward spiral. It's not yet clear to me whether the disk used for RRDs is maxed out when this process starts (eventually it does max out), so I don't know whether I'm hitting a hard disk I/O limit or whether I just spiral into it by successively reducing the amount of buffer cache available. I suspect it would be possible to push the system further if the various rrdcached threads could be decoupled better. Also, being able to put an upper bound on collectd memory would be smart 'cause it's clear that at some point the growth becomes self-defeating. It could randomly drop samples when it hits the limit and that would probably lead to an overall happier outcome.

 - I'm wondering how we could overcome the RRD working set issue. Even with rrdcached and long cache periods (e.g. I use 1 hour) it seems that the system comes to a crawl if the RRD working set exceeds memory. One idea that came to mind is to use the caching in rrdcached to convert the random small writes that are typical for RRDs to more of a sequential access pattern. If we could tweak the RRD creation and the cache write-back algorithm such that RRDs are always accessed in the same order, and we manage to get the RRDs allocated on disk in that order, then we could use the cache to essentially do one sweep through the disk per cache flush period (e.g. per hour in my case). Of course on-demand flushes and other things would interrupt this sweep, but the bulk of accesses could end up being more or less sequential. I believe that doing the cache write-back in a specific order is not too difficult, what I'm not sure of is how to make it such that the RRD files get allocated on disk in the that order too. Any thoughts?

Cheers,
Thorsten
_______________________________________________
rrd-developers mailing list
rrd-developers@lists.oetiker.ch
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers

Reply via email to