Hi Yves

Thanks for your reply! 

We are not using rrdcached. I was aware of it but not in any detail, thanks for 
the recommendation. I can see that even just decoupling the RRD file writes to 
a separate process has big benefits, eg just being able to restart collectd 
without triggering a flush of all RRD files. I'll look at using it on the next 
build. 

So with issue #75, am I correct in thinking that it could explain the 75 minute 
freeze up on RRD updates if a large number of RRDs were being created at that 
time? ... What about the fact that RRDs for the collectd server machine itself 
continued to be updated, does this rule out #75 as a possible cause? 

Perhaps a new set of RRDs needing to be written could have banked up the 
receive queue in the network plugin's thread and triggered a separate bug in 
it? 

OK, I've just thought to look at memory usage on the machine. During the last 
occurrence of this problem, used memory started increasing linearly until 
collectd was restarted, which free'd the memory. Eg between 02:20 and 03:30 on 
this graph: 
http://f.cl.ly/items/0z2P0e3C0e1S3G0W3z3t/Screen%20Shot%202013-01-16%20at%2010.58.41%20PM.png

I suppose this could represent all the queued stats that were unable to be 
written to disk accumulating over time.

And in regards to your other questions:

> About your CPU shooting the roof, could you check if it works full time or if 
> it is waiting for your disk ? (iostat should help).

Most of the time the filesystem is at around 1,600 write operations per second. 
During the 75 minute period today where statistics were not being written, the 
disks were doing very little and there was no IO wait. collectd process was 
using around 107% CPU as reported by top, so perhaps one thread was consuming 
100% of one core and 7% by other threads on other cores. Normally collectd us 
using only between 3% and 8% CPU. 

> 
> Are you sure your disk is not 99% full (perfs are lower when a disk is nearly 
> full).

Yes. It's at 92% currently. 

> Are you sure your disk is not broken ?

No :-) But I would be surprised. There's no IO errors logged at the OS level. 
... It is a filesystem mounted from a SAN. The collectd server is running in a 
Ubuntu VM on ESXi, and the filesystem is a mapper device over four underlying 
disk image files on VMFS. (Yes this is somewhat convoluted! Future stats 
servers will have a lot of local spindles and be physical machines. )

> 
> With iostat, if you have a FS dedicated to the rrd files, have you checked 
> that it is that FS and not another that is working slowly ?

Yes :-) The device is dm-0. Most of the time it sits around 1,600 write ops per 
second. When the problem occurred it dropped down to around 15 write ops per 
second. Disk write time decreased from around 1.4 to around 0.2 while the 
problem was occurring, reflecting a lower load on the disks i presume. After 
restarting collectd both these figures went back to normal after a few minutes. 

We don't use a separate filesystem for the RRDs, it is just the root 
filesystem. collectd is all this box does. 

Cheers
Jesse

On 16/01/2013, at 5:45 PM, Yves Mettier <ymett...@free.fr> wrote:

> Hello,
> 
> Issue #75 is the first think I'm thinking about.
> 
> Are you using rrdcached ?
> If not, you should (but with so many rrds, I'm sure you are).
> 
> If yes, try to configure your collectd to *not* create rrds for some hours 
> (maybe one or two days).
> If this is better for you, you are probably experiencing issue #75.
> 
> Have a look at https://github.com/collectd/collectd/issues/75.
> As far as I know, there is no "good" solution. Only some tips and tricks.
> 
> 
> If not issue #75, here are some ieads...
> 
> About your CPU shooting the roof, could you check if it works full time or if 
> it is waiting for your disk ? (iostat should help).
> 
> Are you sure your disk is not 99% full (perfs are lower when a disk is nearly 
> full).
> Are you sure your disk is not broken ?
> 
> With iostat, if you have a FS dedicated to the rrd files, have you checked 
> that it is that FS and not another that is working slowly ?
> 
> Note : I'm using 5.2, so I will not be able to help you better.
> 
> Regards,
> Yves
> 
> Le 2013-01-16 07:32, Jesse Reynolds a écrit :
>> Hello
>> 
>> We have a collectd server that is writing to about 24,000 RRD files,
>> most of which are 15 MB each (with some at 30 MB and some at 45 MB),
>> about 480 GB of RRD files in all.
>> 
>> On occasion we are seeing disk writes drop right down to a trickle,
>> and at the same time collectd's CPU shooting through the roof. Once
>> collectd goes into this state it can be like this for hours, and the
>> RRD files are mostly not being updated in this time. The only way to
>> get things going again is to 'kill -9 <collectd's pids>' and start
>> collectd again.
>> 
>> The RRD files for data originating within this instance of collectd
>> (and not coming via the network plugin) are not interrupted, so it is
>> something to do with the network plugin, it seems.
>> 
>> Has anyone got any advice on how we might chase this problem down
>> further? We are on Ubuntu 12.04.
>> 
>> Is it possible to peer into collectd to see if it's a problem with
>> the network plugin, or the rrd plugin, or something else?
>> 
>> Thank you
>> Jesse
>> 
>> 
>> _______________________________________________
>> collectd mailing list
>> collectd@verplant.org
>> http://mailman.verplant.org/listinfo/collectd
> 
> -- 
> - Homepage       - http://ymettier.free.fr                             -
> - GPG key        - http://ymettier.free.fr/gpg.txt                     -
> - C en action    - http://ymettier.free.fr/livres/C_en_action_ed2.html -
> - Guide Survie C - http://www.pearson.fr/livre/?GCOI=27440100673730    -
> 
> _______________________________________________
> collectd mailing list
> collectd@verplant.org
> http://mailman.verplant.org/listinfo/collectd


_______________________________________________
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd

Reply via email to