On Sat, Jan 15, 2011 at 9:25 AM, Greg Smith <g...@2ndquadrant.com> wrote: > Once upon a time we got a patch from Itagaki Takahiro whose purpose was to > sort writes before sending them out: > > http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
Ah, a fine idea! > Which has very low odds of the sync on "a" finishing quickly, we'd get this > one: > > table block > a 1 > a 2 > b 1 > b 2 > c 1 > c 2 > sync a > sync b > sync c > > Which sure seems like a reasonable way to improve the odds data has been > written before the associated sync comes along. I'll believe it when I see it. How about this: a 1 a 2 sync a b 1 b 2 sync b c 1 c 2 sync c Or maybe some variant, where we become willing to fsync a file a certain number of seconds after writing the last block, or when all the writes are done, whichever comes first. It seems to me that it's going to be a bear to figure out what fraction of the checkpoint you've completed if you put all of the syncs at the end, and this whole problem appears to be predicated the assumption that the OS *isn't* writing out in a timely fashion. Are we sure that postponing the fsync relative to the writes is anything more than wishful thinking? > Also, I could just traverse the sorted list with some simple logic to count > the number of unique files, and then set the delay between fsync writes > based on it. In the above, once the list was sorted, easy to just see how > many times the table name changes on a linear scan of the sorted data. 3 > files, so if the checkpoint target gives me, say, a minute of time to sync > them, I can delay 20 seconds between. Simple math, and exactly the sort I How does the checkpoint target give you any time to sync them? Unless you squeeze the writes together more tightly, but that seems sketchy. > So I fixed the bitrot on the old sorted patch, which was fun as it came from > before the 8.3 changes. It seemed to work. I then moved the structure it > uses to hold the list of buffers to write, the thing that's sorted, into > shared memory. It's got a predictable maximum size, relying on palloc in > the middle of the checkpoint code seems bad, and there's some potential gain > from not reallocating it every time through. Well you don't have to put it in shared memory on account of any of that. You can just hang it on a global variable. > There's good bits in the patch I submitted for the last CF and in the patch > you wrote earlier this week. This unfinished patch may be a valuable idea > to fit in there too once I fix it, or maybe it's fundamentally flawed and > one of the other ideas you suggested (or I have sitting on the potential > design list) will work better. There's a patch integration problem that > needs to be solved here, but I think almost all the individual pieces are > available. I'd hate to see this fail to get integrated now just for lack of > time, considering the problem is so serious when you run into it. Likewise, but committing something half-baked is no good either. I think we're in a position to crush the full-fsync-queue problem flat (my patch should do that, and there are several other obvious things we can do for extra certainty) but the problem of spreading out the fsyncs looks to me like something we don't completely know how to solve. If we can find something that's a modest improvement on the status quo and we can be confident in quickly, good, but I'd rather have 9.1 go out the door on time without fully fixing this than delay the release. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers