On 2014-08-30 14:16:10 -0400, Tom Lane wrote: > Andres Freund <and...@2ndquadrant.com> writes: > > On 2014-08-30 13:50:40 -0400, Tom Lane wrote: > >> A possible compromise is to sort a limited number of > >> buffers ---- say, collect a few thousand dirty buffers then sort, dump and > >> fsync them, repeat as needed. > > > Yea, that's what I suggested nearby. But I don't really like it, because > > it robs us of the the chance to fsync() a relfilenode immediately after > > having synced all its buffers. > > Uh, how so exactly? You could still do that. Yeah, you might fsync a rel > once per sort-group and not just once per checkpoint, but it's not clear > that that's a loss as long as the group size isn't tiny.
Because it wouldn't have the benefit of sycing the minimal amount of data anymore. If lots of other relfilenodes have been synced inbetween the amount of newly dirtied pages in the os' buffercache (written by backends, bgwriter) for a individual relfilenode is much higher. A fsync() on a file with dirty data often causes *serious* latency spikes - we should try hard to avoid superflous calls. As an example: Calling fsync() on pgbench_accounts's underlying files, from outside postgres, *before* postgres even started its first checkpoint does this: progress: 72.0 s, 4324.9 tps, lat 41.481 ms stddev 40.567 progress: 73.0 s, 4704.9 tps, lat 38.465 ms stddev 35.436 progress: 74.0 s, 4448.5 tps, lat 40.058 ms stddev 32.634 progress: 75.0 s, 4634.5 tps, lat 39.229 ms stddev 33.463 progress: 76.8 s, 2753.1 tps, lat 48.693 ms stddev 75.309 progress: 77.1 s, 126.6 tps, lat 773.433 ms stddev 222.667 progress: 78.0 s, 183.7 tps, lat 786.401 ms stddev 395.954 progress: 79.1 s, 170.3 tps, lat 975.949 ms stddev 596.751 progress: 80.0 s, 2116.6 tps, lat 168.608 ms stddev 398.933 progress: 81.0 s, 4436.1 tps, lat 40.313 ms stddev 34.198 progress: 82.0 s, 4383.9 tps, lat 41.811 ms stddev 37.241 Note the dip from 4k tps to 130 tps. We can get a handle on that (on some platforms at least) for writes issued during the buffer sync by forcing the kernel to write out the pages in small increments; but I doubt we want to do that for writes by backends themselves. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers