Robert Haas wrote:
Doing all the writes and then all the fsyncs meets this requirement trivially, but I'm not so sure that's a good idea. For example, given files F1 ... Fn with dirty pages needing checkpoint writes, we could do the following: first, do any pending fsyncs for files not among F1 .. Fn; then, write all pages for F1 and fsync, write all pages for F2 and fsync, write all pages for F3 and fsync, etc. This might seem dumb because we're not really giving the OS a chance to write anything out before we fsync, but think about the ext3 case where the whole filesystem cache gets flushed anyway.
I'm not horribly interested in optimizing for the ext3 case per se, as I consider that filesystem fundamentally broken from the perspective of its ability to deliver low-latency here. I wouldn't want a patch that improved behavior on filesystem with granular fsync to make the ext3 situation worst. That's as much as I'd want design to lean toward considering its quirks. Jeff Janes made a case downthread for "why not make it the admin/OS's job to worry about this?" In cases where there is a reasonable solution available, in the form of "switch to XFS or ext4", I'm happy to take that approach.
Let me throw some numbers out to give a better idea of the shape and magnitude of the problem case I've been working on here. In the situation that leads that the near hour-long sync phase I've seen, checkpoints will start with about a 3GB backlog of data in the kernel write cache to deal with. That's about 4% of RAM, just under the 5% threshold set by dirty_background_ratio. Whether or not the 256MB write cache on the controller is also filled is a relatively minor detail I can't monitor easily. The checkpoint itself? <250MB each time. This proportion is why I didn't think to follow the alternate path of worrying about spacing the write and fsync calls out differently. I shrunk shared_buffers down to make the actual checkpoints smaller, which helped to some degree; that's what got them down to smaller than the RAID cache size. But the amount of data cached by the operating system is the real driver of total sync time here. Whether or not you include all of the writes from the checkpoint itself before you start calling fsync didn't actually matter very much; in the case I've been chasing, those are getting cached anyway. The write storm from the fsync calls themselves forcing things out seems to be the driver on I/O spikes, which is why I started with spacing those out.
Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog takes a minimum of 10 minutes of real time. There are about 300 1GB relation files involved in the case I've been chasing. This is where the 3 second delay number came from; 300 files, 3 seconds each, 900 seconds = 15 minutes of sync spread. You can turn that math around to figure out how much delay per relation you can afford while still keeping checkpoints to a planned end time, which isn't done in the patch I submitted yet.
Ultimately what I want to do here is some sort of smarter write-behind sync operation, perhaps with a LRU on relations with pending fsync requests. The idea would be to sync relations that haven't been touched in a while in advance of the checkpoint even. I think that's similar to the general idea Robert is suggesting here, to get some sync calls flowing before all of the checkpoint writes have happened. I think that the final sync calls will need to get spread out regardless, and since doing that requires a fairly small amount of code too that's why we started with that.
-- Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers