Re: [HACKERS] Spread checkpoint sync

Greg Smith Sun, 21 Nov 2010 13:54:36 -0800

Robert Haas wrote:

Doing all the writes and then all the fsyncs meets this requirement
trivially, but I'm not so sure that's a good idea.  For example, given
files F1 ... Fn with dirty pages needing checkpoint writes, we could
do the following: first, do any pending fsyncs for files not among F1
.. Fn; then, write all pages for F1 and fsync, write all pages for F2
and fsync, write all pages for F3 and fsync, etc.  This might seem
dumb because we're not really giving the OS a chance to write anything
out before we fsync, but think about the ext3 case where the whole
filesystem cache gets flushed anyway.

I'm not horribly interested in optimizing for the ext3 case per se, as Iconsider that filesystem fundamentally broken from the perspective ofits ability to deliver low-latency here. I wouldn't want a patch thatimproved behavior on filesystem with granular fsync to make the ext3situation worst. That's as much as I'd want design to lean towardconsidering its quirks. Jeff Janes made a case downthread for "why notmake it the admin/OS's job to worry about this?" In cases where thereis a reasonable solution available, in the form of "switch to XFS orext4", I'm happy to take that approach.

Let me throw some numbers out to give a better idea of the shape andmagnitude of the problem case I've been working on here. In thesituation that leads that the near hour-long sync phase I've seen,checkpoints will start with about a 3GB backlog of data in the kernelwrite cache to deal with. That's about 4% of RAM, just under the 5%threshold set by dirty_background_ratio. Whether or not the 256MB writecache on the controller is also filled is a relatively minor detail Ican't monitor easily. The checkpoint itself? <250MB each time.This proportion is why I didn't think to follow the alternate path ofworrying about spacing the write and fsync calls out differently. Ishrunk shared_buffers down to make the actual checkpoints smaller, whichhelped to some degree; that's what got them down to smaller than theRAID cache size. But the amount of data cached by the operating systemis the real driver of total sync time here. Whether or not you includeall of the writes from the checkpoint itself before you start callingfsync didn't actually matter very much; in the case I've been chasing,those are getting cached anyway. The write storm from the fsync callsthemselves forcing things out seems to be the driver on I/O spikes,which is why I started with spacing those out.

Writes go out at a rate of around 5MB/s, so clearing the 3GB backlogtakes a minimum of 10 minutes of real time. There are about 300 1GBrelation files involved in the case I've been chasing. This is wherethe 3 second delay number came from; 300 files, 3 seconds each, 900seconds = 15 minutes of sync spread. You can turn that math around tofigure out how much delay per relation you can afford while stillkeeping checkpoints to a planned end time, which isn't done in the patchI submitted yet.

Ultimately what I want to do here is some sort of smarter write-behindsync operation, perhaps with a LRU on relations with pending fsyncrequests. The idea would be to sync relations that haven't been touchedin a while in advance of the checkpoint even. I think that's similar tothe general idea Robert is suggesting here, to get some sync callsflowing before all of the checkpoint writes have happened. I think thatthe final sync calls will need to get spread out regardless, and sincedoing that requires a fairly small amount of code too that's why westarted with that.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

Reply via email to