I've been trying to get rrdcached and collectd to work together under load and am running into a number of issues. I'm using: - collectd 4.7.4 - rrdtool trunk rev 1889 (had trouble compiling collectd with newer versions) - rrdcached stats snapshot: 9 Statistics follow QueueLength: 0 UpdatesReceived: 91123595 FlushesReceived: 83983 UpdatesWritten: 210810 DataSetsWritten: 82474667 TreeNodesNumber: 25925 TreeDepth: 17 JournalBytes: 6846161193 JournalRotate: 4 - approx 3k updates written to rrdcached per second - approx 200-300KB written to journal per second - approx 2k-3k data sets written per second - rrdached params: -w 3600 -z 3600 -f 7200 -t 10 - disk I/O is not an issue - rrdcached memory usage is not an issue (grows to 0.8GB then stays totally flat), no swapping - running collectd, rrdcached, and custom graphing app on same dual-core server, verified that flushing for graphing is working properly
First issue is that over time the data in the rrd files lag behind the data arriving into collectd in the network. After 12 hours I see approx a 5 minute lag. I've seen it go to >1.5 hrs after a bunch of days. The symptoms are that data in the rrd files continues to advance at the normal rate (20 second interval in our case) but just in the past. The fact that the delay is steady leads me to believe that it's a program bug (I've seen delays due to I/O overload in the past and different rrds then show different lags and jump forward when they finally get some disk attention). I've done some tests looking at the last_updated in the rrd and looking at what the rrdcached daemon returns to a PENDING command for the same file and I'm sure the flushing works. The daemon just doesn't receive newer updates. The journal is also in sync with all this. If I restart collectd, then the lag pretty quickly vanishes. So either collectd has some queue with a bug, or data is queued in the socket between collectd and rrdcached. I get the same delay whether I use a unix sock or a tcp sock and the amount of data "queued" is such that it's not in system buffers (the rrdcached journal is written at 200kB/sec and I believe that's the same rate at which rrdcached receives data). The second issue, which may possibly cause the first one is that the cpu consumed by rrdcached is way too high. After running for about an hour it consumes a full cpu (~90% user + ~10% system). It could be that that's causing the above lag, dunno. I/O is not a problem as I mentioned, it's pure CPU. I've compiled rrdcched with -pg to get gprof output, but haven't been successful. I commented out install_signal_handlers (left the USR2 to be able to terminate gracefully) and ran with -g, but the gprof output shows only ~2 minutes of CPU time profiled when the daemon accumulated >250mins. Here's the top of the output: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 55.12 62.39 62.39 280843249 0.00 0.00 buffer_get_field 11.33 75.22 12.83 93607575 0.00 0.00 send_response 10.32 86.91 11.68 93464852 0.00 0.00 handle_request_update 5.36 92.97 6.06 connection_thread_main 4.03 97.53 4.57 93683555 0.00 0.00 handle_request 3.46 101.46 3.92 93484712 0.00 0.00 check_file_access 3.29 105.18 3.72 176583057 0.00 0.00 next_cmd 1.33 106.69 1.51 93686967 0.00 0.00 find_command 1.23 108.08 1.40 88419974 0.00 0.00 journal_write 1.00 109.22 1.14 93672403 0.00 0.00 has_privilege It looks to like that's mostly the journal replay stuff and very little more. If someone has tips on how to get real profiling output, I'm all ears. The journal replay is too slow. When I terminate the daemon it leaves several GB of journal files behind. Reading those in takes the better part of an hour, during which the daemon is unresponsive. Most of time is in buffer_get_field. (Note: in the most common cases buffer_get_field copies each field in-place, character by character. Seems to me that a simple if statement could avoid the writes.) By the way, I find the unix socket stuff undebuggable. I switched to TCP sockets because I can telnet to the socket and find out what the daemon is doing. (For example, when nothing seems to work for almost an hour when I start the daemon because it's replaying logs there is no information about what's going on anywhere.) I'm saying this because everyone recommends the unix sockets for security reasons. It's unusable IMHO. I think this is very close to being an extremely high performance RRD monitoring system, but it's not quite there yet. I'd appreciate any pointers on what to pursue. I hope tat the above descriptions will ring some bells in those of you that wrote some of the code. I'm available to test things out and collect more info. Unfortunately I have only little time to dig into the code myself, sigh. Thanks, Thorsten _______________________________________________ collectd mailing list collectd@verplant.org http://mailman.verplant.org/listinfo/collectd