On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: > A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write > them out in order > (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp). > The performance impact of that was inconclusive, but one thing that it > allows nicely is to interleave the fsyncs, so that you write all the buffers > for one file, then fsync it, then next file and so on. IIRC the biggest > worry with that patch was that sorting the buffers requires a fairly large > amount of memory, and making a large allocation in the checkpointer might > cause an out-of-memory, which would be bad. > > I don't think anyone's seriously worked on this area since. If the impact on > responsiveness or performance is significant, I'm pretty sure the OOM > problem could be alleviated somehow.
I've dug up that patch (after a bit of fighting with the archives) and refreshed it. It's *clearly* beneficial: master: andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1 starting vacuum...end. progress: 1.0 s, 2847.6 tps, lat 53.862 ms stddev 49.219 ... progress: 67.0 s, 3435.4 tps, lat 52.920 ms stddev 48.720 progress: 68.2 s, 2586.9 tps, lat 57.793 ms stddev 64.228 progress: 69.1 s, 546.5 tps, lat 294.940 ms stddev 189.546 progress: 70.0 s, 1741.3 tps, lat 134.298 ms stddev 204.740 progress: 71.0 s, 3868.8 tps, lat 48.423 ms stddev 47.934 .. progress: 89.0 s, 4022.8 tps, lat 45.601 ms stddev 40.685 progress: 90.0 s, 2463.5 tps, lat 61.907 ms stddev 64.342 progress: 91.2 s, 856.3 tps, lat 211.610 ms stddev 149.916 progress: 92.0 s, 1026.9 tps, lat 177.103 ms stddev 144.448 progress: 93.0 s, 736.9 tps, lat 254.230 ms stddev 227.339 progress: 94.1 s, 766.9 tps, lat 208.031 ms stddev 181.340 progress: 95.1 s, 979.7 tps, lat 197.014 ms stddev 193.648 progress: 96.0 s, 868.9 tps, lat 214.060 ms stddev 198.297 progress: 97.1 s, 943.4 tps, lat 178.062 ms stddev 143.224 progress: 98.0 s, 934.5 tps, lat 192.435 ms stddev 197.901 progress: 99.6 s, 623.1 tps, lat 202.954 ms stddev 165.576 progress: 100.0 s, 464.7 tps, lat 683.600 ms stddev 376.520 progress: 101.1 s, 516.0 tps, lat 395.033 ms stddev 480.346 progress: 102.0 s, 364.9 tps, lat 507.933 ms stddev 499.670 progress: 103.3 s, 592.9 tps, lat 214.123 ms stddev 273.411 progress: 104.1 s, 930.2 tps, lat 316.487 ms stddev 335.096 progress: 105.4 s, 627.6 tps, lat 262.496 ms stddev 200.690 progress: 106.1 s, 788.6 tps, lat 235.510 ms stddev 202.366 progress: 107.5 s, 644.8 tps, lat 269.020 ms stddev 223.900 progress: 108.0 s, 725.0 tps, lat 262.692 ms stddev 218.774 progress: 109.0 s, 660.0 tps, lat 272.808 ms stddev 248.501 progress: 110.0 s, 604.3 tps, lat 303.727 ms stddev 264.921 progress: 111.0 s, 723.6 tps, lat 243.224 ms stddev 229.426 progress: 112.1 s, 668.6 tps, lat 257.026 ms stddev 190.247 progress: 113.1 s, 345.0 tps, lat 492.114 ms stddev 312.082 progress: 115.4 s, 390.9 tps, lat 416.708 ms stddev 391.577 progress: 115.4 s, 14598.5 tps, lat 551.617 ms stddev 539.611 progress: 116.1 s, 161.5 tps, lat 741.611 ms stddev 485.498 progress: 117.4 s, 269.1 tps, lat 697.978 ms stddev 576.970 progress: 118.8 s, 262.3 tps, lat 674.887 ms stddev 587.848 progress: 119.1 s, 195.2 tps, lat 833.959 ms stddev 733.592 progress: 120.0 s, 3000.6 tps, lat 104.272 ms stddev 291.851 progress: 121.0 s, 3167.7 tps, lat 56.576 ms stddev 51.976 progress: 122.0 s, 3398.1 tps, lat 51.322 ms stddev 48.057 progress: 123.0 s, 3721.9 tps, lat 50.355 ms stddev 46.994 progress: 124.0 s, 2929.3 tps, lat 50.996 ms stddev 45.553 progress: 125.0 s, 754.5 tps, lat 269.293 ms stddev 287.508 progress: 126.0 s, 3297.0 tps, lat 56.912 ms stddev 77.053 ... progress: 144.0 s, 4207.9 tps, lat 44.440 ms stddev 37.210 progress: 145.9 s, 2036.4 tps, lat 79.025 ms stddev 105.411 progress: 146.0 s, 1003.1 tps, lat 292.934 ms stddev 223.650 progress: 147.4 s, 520.8 tps, lat 318.670 ms stddev 244.596 progress: 148.0 s, 3557.3 tps, lat 71.626 ms stddev 143.174 progress: 149.0 s, 4106.1 tps, lat 43.557 ms stddev 36.444 progress: 150.0 s, 4132.3 tps, lat 43.185 ms stddev 34.611 progress: 151.0 s, 4233.3 tps, lat 43.239 ms stddev 39.121 progress: 152.0 s, 4178.2 tps, lat 43.242 ms stddev 40.377 progress: 153.0 s, 755.1 tps, lat 198.560 ms stddev 155.927 progress: 154.1 s, 773.6 tps, lat 240.044 ms stddev 192.472 progress: 155.0 s, 553.7 tps, lat 303.532 ms stddev 245.491 progress: 156.2 s, 772.7 tps, lat 242.925 ms stddev 226.754 progress: 157.1 s, 541.0 tps, lat 295.132 ms stddev 218.857 progress: 158.1 s, 716.8 tps, lat 281.823 ms stddev 227.488 progress: 159.1 s, 748.7 tps, lat 223.275 ms stddev 186.162 progress: 160.0 s, 503.0 tps, lat 311.621 ms stddev 215.952 progress: 161.1 s, 905.0 tps, lat 239.623 ms stddev 245.539 progress: 162.4 s, 360.4 tps, lat 329.583 ms stddev 250.094 progress: 163.3 s, 348.9 tps, lat 583.476 ms stddev 432.200 progress: 165.5 s, 186.1 tps, lat 765.542 ms stddev 552.133 progress: 165.5 s, 9950.2 tps, lat 1424.984 ms stddev 240.249 progress: 166.4 s, 137.3 tps, lat 1056.184 ms stddev 765.356 progress: 167.2 s, 249.7 tps, lat 793.423 ms stddev 792.666 progress: 168.6 s, 194.0 tps, lat 851.712 ms stddev 737.653 progress: 169.2 s, 136.0 tps, lat 1229.394 ms stddev 1227.723 progress: 170.6 s, 92.3 tps, lat 1569.107 ms stddev 1517.027 progress: 171.2 s, 276.5 tps, lat 1325.410 ms stddev 1252.237 progress: 172.0 s, 3881.2 tps, lat 91.122 ms stddev 366.052 progress: 174.4 s, 995.6 tps, lat 120.200 ms stddev 242.005 progress: 174.4 s, 16227.2 tps, lat 651.361 ms stddev 527.946 progress: 175.1 s, 200.0 tps, lat 808.114 ms stddev 542.989 progress: 176.1 s, 179.4 tps, lat 1024.893 ms stddev 724.887 progress: 177.4 s, 367.0 tps, lat 648.519 ms stddev 790.413 progress: 178.4 s, 2236.7 tps, lat 114.064 ms stddev 271.525 progress: 179.0 s, 3643.5 tps, lat 48.404 ms stddev 47.793 progress: 180.0 s, 3309.0 tps, lat 55.598 ms stddev 49.211 transaction type: TPC-B (sort of) scaling factor: 100 query mode: prepared number of clients: 180 number of threads: 180 duration: 180 s number of transactions actually processed: 496443 latency average: 65.245 ms latency stddev: 134.292 ms tps = 2756.518300 (including connections establishing) tps = 2756.718588 (excluding connections establishing) LOG: checkpoint starting: time LOG: checkpoint complete: wrote 139127 buffers (21.2%); 0 transaction log file(s) added, 0 removed, 8 recycled; write=53.660 s, sync=6.739 s, total=60.611 s; sync files=18, longest=4.382 s, average=0.374 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 70472 buffers (10.8%); 0 transaction log file(s) added, 0 removed, 70 recycled; write=53.107 s, sync=5.375 s, total=58.605 s; sync files=15, longest=3.243 s, average=0.358 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 80357 buffers (12.3%); 0 transaction log file(s) added, 0 removed, 49 recycled; write=53.526 s, sync=3.896 s, total=57.573 s; sync files=15, longest=1.206 s, average=0.259 s sorted writes: starting vacuum...end. progress: 1.0 s, 3237.2 tps, lat 49.544 ms stddev 42.203 .. progress: 88.0 s, 3030.8 tps, lat 47.438 ms stddev 46.563 progress: 89.3 s, 919.0 tps, lat 204.152 ms stddev 172.776 progress: 90.1 s, 973.7 tps, lat 204.650 ms stddev 197.843 progress: 91.1 s, 875.5 tps, lat 200.472 ms stddev 157.295 progress: 92.0 s, 939.5 tps, lat 185.357 ms stddev 146.362 progress: 93.1 s, 975.8 tps, lat 184.445 ms stddev 167.520 progress: 94.9 s, 673.0 tps, lat 260.900 ms stddev 224.125 progress: 95.0 s, 1057.6 tps, lat 223.717 ms stddev 260.276 progress: 96.0 s, 917.7 tps, lat 201.704 ms stddev 220.395 progress: 97.0 s, 2573.0 tps, lat 78.913 ms stddev 119.458 progress: 98.0 s, 3921.5 tps, lat 44.160 ms stddev 39.147 progress: 99.0 s, 4033.3 tps, lat 43.972 ms stddev 45.921 .. progress: 110.0 s, 4057.8 tps, lat 44.283 ms stddev 47.003 progress: 111.0 s, 897.6 tps, lat 158.250 ms stddev 132.126 progress: 112.0 s, 751.6 tps, lat 249.282 ms stddev 232.964 progress: 113.0 s, 550.6 tps, lat 296.155 ms stddev 226.163 progress: 114.0 s, 1875.3 tps, lat 123.059 ms stddev 210.661 progress: 115.1 s, 4555.4 tps, lat 39.404 ms stddev 32.331 .. progress: 151.0 s, 4240.2 tps, lat 42.158 ms stddev 38.577 progress: 152.1 s, 2551.3 tps, lat 63.584 ms stddev 73.780 progress: 153.1 s, 912.5 tps, lat 189.030 ms stddev 146.658 progress: 154.2 s, 976.2 tps, lat 192.499 ms stddev 158.108 progress: 155.0 s, 1005.1 tps, lat 178.215 ms stddev 129.062 progress: 156.0 s, 1010.1 tps, lat 175.583 ms stddev 138.823 progress: 157.0 s, 1042.0 tps, lat 161.675 ms stddev 135.350 progress: 158.1 s, 766.8 tps, lat 216.367 ms stddev 234.374 progress: 159.5 s, 1025.7 tps, lat 199.366 ms stddev 228.530 progress: 160.0 s, 1178.6 tps, lat 159.156 ms stddev 130.660 progress: 161.0 s, 2904.1 tps, lat 65.751 ms stddev 79.928 progress: 162.0 s, 4189.6 tps, lat 43.492 ms stddev 38.490 .. progress: 170.0 s, 4150.9 tps, lat 41.998 ms stddev 40.931 progress: 171.1 s, 595.5 tps, lat 233.138 ms stddev 188.073 progress: 172.0 s, 622.6 tps, lat 245.023 ms stddev 279.715 progress: 173.0 s, 588.1 tps, lat 359.104 ms stddev 315.367 progress: 174.0 s, 3011.0 tps, lat 74.811 ms stddev 172.972 progress: 175.0 s, 4829.0 tps, lat 37.520 ms stddev 30.578 progress: 176.0 s, 4324.1 tps, lat 41.388 ms stddev 37.727 progress: 177.0 s, 3632.8 tps, lat 49.219 ms stddev 47.092 progress: 178.0 s, 3621.0 tps, lat 50.332 ms stddev 43.086 progress: 179.0 s, 3468.3 tps, lat 50.969 ms stddev 43.025 progress: 180.0 s, 3710.7 tps, lat 48.782 ms stddev 49.070 transaction type: TPC-B (sort of) scaling factor: 100 query mode: prepared number of clients: 180 number of threads: 180 duration: 180 s number of transactions actually processed: 634014 latency average: 51.089 ms latency stddev: 63.864 ms number of transactions above the 100.0 ms latency limit: 66921 tps = 3520.574003 (excluding connections establishing) LOG: checkpoint starting: time LOG: checkpoint complete: wrote 135453 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=53.419 s, sync=3.665 s, total=57.390 s; sync files=17, longest=1.937 s, average=0.215 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 96907 buffers (14.8%); 0 transaction log file(s) added, 0 removed, 68 recycled; write=53.147 s, sync=3.369 s, total=56.633 s; sync files=15, longest=1.920 s, average=0.224 s sorted writes + flush in SyncOneBuffer(): andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1 starting vacuum...end. progress: 1.0 s, 3044.7 tps, lat 49.030 ms stddev 45.608 progress: 2.0 s, 3701.3 tps, lat 48.231 ms stddev 44.762 ... progress: 55.0 s, 4157.1 tps, lat 43.216 ms stddev 40.821 progress: 56.0 s, 4736.3 tps, lat 38.326 ms stddev 35.699 progress: 57.0 s, 2604.9 tps, lat 66.072 ms stddev 65.534 progress: 58.0 s, 1659.8 tps, lat 113.848 ms stddev 121.625 progress: 59.0 s, 2625.2 tps, lat 66.112 ms stddev 54.349 .. progress: 106.0 s, 2571.8 tps, lat 72.806 ms stddev 83.308 progress: 107.0 s, 1818.7 tps, lat 93.315 ms stddev 103.284 progress: 108.1 s, 1876.6 tps, lat 97.447 ms stddev 103.151 progress: 109.1 s, 1768.8 tps, lat 110.406 ms stddev 121.347 progress: 110.0 s, 3129.4 tps, lat 56.100 ms stddev 54.835 progress: 111.0 s, 4402.0 tps, lat 40.581 ms stddev 42.269 progress: 112.0 s, 4379.3 tps, lat 42.625 ms stddev 39.507 progress: 113.0 s, 4551.6 tps, lat 39.303 ms stddev 30.563 progress: 114.0 s, 4383.3 tps, lat 41.169 ms stddev 37.459 progress: 115.1 s, 4652.3 tps, lat 38.769 ms stddev 33.415 progress: 116.1 s, 4015.5 tps, lat 41.939 ms stddev 34.122 progress: 117.1 s, 2159.0 tps, lat 84.653 ms stddev 81.445 progress: 118.0 s, 2971.3 tps, lat 62.874 ms stddev 62.229 progress: 119.0 s, 2825.1 tps, lat 60.875 ms stddev 54.767 progress: 120.0 s, 2542.9 tps, lat 75.276 ms stddev 68.709 progress: 121.1 s, 2650.4 tps, lat 63.833 ms stddev 56.419 progress: 122.0 s, 2766.8 tps, lat 68.196 ms stddev 67.042 .. progress: 166.0 s, 2412.4 tps, lat 71.307 ms stddev 74.961 progress: 167.1 s, 1742.5 tps, lat 99.519 ms stddev 102.500 progress: 168.1 s, 1324.2 tps, lat 122.625 ms stddev 135.177 progress: 169.0 s, 2002.5 tps, lat 108.435 ms stddev 125.576 progress: 170.0 s, 2796.6 tps, lat 64.867 ms stddev 64.665 progress: 171.0 s, 3969.1 tps, lat 45.520 ms stddev 41.656 ... progress: 180.0 s, 3003.9 tps, lat 59.134 ms stddev 51.508 transaction type: TPC-B (sort of) scaling factor: 100 query mode: prepared number of clients: 180 number of threads: 180 duration: 180 s number of transactions actually processed: 584725 latency average: 55.375 ms latency stddev: 56.202 ms number of transactions above the 100.0 ms latency limit: 76644 tps = 3246.299654 (including connections establishing) tps = 3246.607040 (excluding connections establishing) LOG: checkpoint starting: time LOG: checkpoint complete: wrote 135474 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=53.260 s, sync=0.079 s, total=53.577 s; sync files=19, longest=0.032 s, average=0.004 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 92770 buffers (14.2%); 0 transaction log file(s) added, 0 removed, 68 recycled; write=53.677 s, sync=0.239 s, total=54.026 s; sync files=16, longest=0.093 s, average=0.014 s So, *very* clearly sorting is a benefit. Both the dips in performance are lower and the aggregate throughput is higher. Additionally flushing buffers during the flush is a clear win from the perspective of reliable performance, but noticeable performancewise. Although much less than without sorting. Now, I've used a absurdly small checkpoint_timeout here to have the time to run these benchmarks. But I'm damn sure this similar with real settings. So, what I think we need to do is: 1) Sort writes on checkpoints. It's a clear win. Personally I'm perfectly happy waving away the concerns about OOM. It's not that much memory: # SELECT pg_size_pretty(24 * ((64::int8 * 1024 * 1024 * 1024) / 8192::int8)); ┌────────────────┐ │ pg_size_pretty │ ├────────────────┤ │ 192 MB │ └────────────────┘ If you can aford 64GB shared buffers, 192MB memory allocated in the checkpointer hardly is a problem. (24 is sizeof(BufferTag) + sizeof(bufid)) 2) At least optionally flush dirty buffers during the buffer sync scan. The improvements in steady performance are too big to ignore. I think we might be able to make it an unconditional win by flushing in batches instead of doing so for each buffer like I've done. 3) Possibly do the mdsync() after & for each filenode (or segment?) like Heikki basically just prototyped. That'll reduce the number of dirty buffers our fsync() will flush out unnecessarily (which will create stalls because they'll sit in the kernel's buffer cache). > For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs, > see attached. It works by repeatedly scanning the buffer pool, writing > buffers belonging to a single relation segment at a time. I would be > interested to hear how this performs in your test case. I think that'll perform horribly on large shared buffers with many relations. Not a uncommon scenario... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers