On Wed, Aug 16, 2006 at 07:19:01AM -0400, Richard A Steenbergen wrote:
> On Wed, Aug 16, 2006 at 08:10:09AM +0200, Henrik Stoerner wrote:
> > However, my main system for this currently has about 20.000 RRD files,
> > all of which are updated every 5 minutes. So that's about 70 updates
> > per second, and I can see that the amount of disk I/O happening on
> > this server will become a performance problem soon, as more systems are
> > added and hence more RRD files need updating.
> The situation I was trying to solve involved a constant stream of high 
> resolution data across a large set of records, and relatively infrequent 
> viewing of that data. It sounds like you're trying to do something 
> similar. Honestly if all you care about is databasing it would probably be 
> easier to ditch RRD and use something else or write your own db which is 
> more efficient, but at the end of the day (for me anyways :P) rrdtool does 
> the best job of producing pretty pictures that don't look like they came 
> off of gnuplot or my EKG, and I'm in no mood to become a graphics person 
> and re-invent the wheel.

I would be very sad to drop RRDtool, for those very reasons. It is the
de-facto standard for storing time-series based data on Unix, and there
are so many neat utilities around for working with RRD files.

> So, probably your biggest issue is indeed thrashing the hell out of the 
> disk if you just tried to naively fire off a pile of forks and hope it all 
> works out for the best. [snip]
> Obviously a syscall to exec a shell to run the rrdtool binary every time 
> scales to about nothing, and the API (if you can even call it that, I 
> don't think (argc, argv) counts :P) to rrdtool functions in C really and 
> truly bites. If your application is in C, and you can link directly to the 
> librrd, thats a quick and dirty fix for at least some of the evils.

That is basically what I do.

The fork()/exec() calls have been eliminated, since Hobbit uses a module
which calls into directly into the rrdtool library API. So I am calling 
the rrd_update() function directly. (Whew - wouldn't even dare to think
how much more overhead it would be to do the updates via the rrdtool
commandline tool).

> The big daddy of performance suck is then going to be, opening, closing, 
> and seeking the right spot in the files every time.


I can see you've been through many of the same deliberations as I have,
and come to just about the same conclusions. More spindles would help,
but only up to a point. Using RAM disks and keeping a cache of open file
handles is not going to work with the amount of data I have, unfortunately.

Consolidating datapoints into fewer files is a possibility, but at the
cost of making the code doing updates more complex - it is not
guaranteed that all of the data-updates will be available simultaneously.

> Or hell you could always just throw more spindles at it or throw a few 
> more $500 linux PCs at it, what do I care. :)

Throwing cheap PC's at the problem is kind of what I was thinking of :-)
I'd like to spread the RRD files across a number of cheap servers,
but in a way that makes it easy to add more servers if it becomes

Anyway, thanks for your comments. They assure me there isn't some
obvious solution that I've missed.


