On Tue, Jul 23, 2013 at 12:13 PM, Greg Smith <g...@2ndquadrant.com> wrote: > On 7/23/13 10:56 AM, Robert Haas wrote: >> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <g...@2ndquadrant.com> wrote: >>> >>> We know that a 1GB relation segment can take a really long time to write >>> out. That could include up to 128 changed 8K pages, and we allow all of >>> them to get dirty before any are forced to disk with fsync. >> >> By my count, it can include up to 131,072 changed 8K pages. > > Even better! I can pinpoint exactly what time last night I got tired enough > to start making trivial mistakes. Everywhere I said 128 it's actually > 131,072, which just changes the range of the GUC I proposed. > > Getting the number right really highlights just how bad the current > situation is. Would you expect the database to dump up to 128K writes into > a file and then have low latency when it's flushed to disk with fsync? Of > course not. But that's the job the checkpointer process is trying to do > right now. And it's doing it blind--it has no idea how many dirty pages > might have accumulated before it started. > > I'm not exactly sure how best to use the information collected. fsync every > N writes is one approach. Another is to use accumulated writes to predict > how long fsync on that relation should take. Whenever I tried to spread > fsync calls out before, the scale of the piled up writes from backends was > the input I really wanted available. The segment write count gives an > alternate way to sort the blocks too, you might start with the heaviest hit > ones. > > In all these cases, the fundamental I keep coming back to is wanting to cue > off past write statistics. If you want to predict relative I/O delay times > with any hope of accuracy, you have to start the checkpoint knowing > something about the backend and background writer activity since the last > one.
So, I don't think this is a bad idea; in fact, I think it'd be a good thing to explore. The hard part is likely to be convincing ourselves of anything about how well or poorly it works on arbitrary hardware under arbitrary workloads, but we've got to keep trying things until we find something that works well, so why not this? One general observation is that there are two bad things that happen when we checkpoint. One is that we force all of the data in RAM out to disk, and the other is that we start doing lots of FPIs. Both of these things harm throughput. Your proposal allows the user to make the first of those behaviors more frequent without making the second one more frequent. That idea seems promising, and it also seems to admit of many variations. For example, instead of issuing an fsync when after N OS writes to a particular file, we could fsync the file with the most writes every K seconds. That way, if the system has busy and idle periods, we'll effectively "catch up on our fsyncs" when the system isn't that busy, and we won't bunch them up too much if there's a sudden surge of activity. Now that's just a shot in the dark and there might be reasons why it's terrible, but I just generally offer it as food for thought that the triggering event for the extra fsyncs could be chosen via a multitude of different algorithms, and as you hack through this it might be worth trying a few different possibilities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers