I have a large database, 1.3 TB, with quite a bit of write activity. The 
machine has, 2 cpus x 6 cores x 2 threads (2 x E5-2630 v2 @ 2.60GHz), 4 x EVO 
Pro 2TB SSDs in a RAID 1+0 software raid configuration, on a SATA 3 controller.

The machine has a lot of memory, 384 GB, so it doesn’t do a lot of reads, as 
most of the frequently accessed data is cached. 

Reads are between 5-10 MB/s on each SSD, giving a total of about 20-30 MB/s for 
the entire RAID volume.

Each SSD writes on average 30-40 MB/s, so overall the RAID writes on 2 mirrors 
at 60-80 MB/s each, so 120-160 MB/s total writes.

According to AnandTech benchmarks, those SSDs should be able to do around 90 
MB/s random reads and 250 MB/s random writes at 4KB block size, so in theory I 
shouldn’t hit I/O limits even if all I/O was random and very small. 

I estimate the SSDs should be about 10% busy from the reads and 15% busy from 
the writes, so maybe 25% total.

However sar -d shows %util on those SSDs between 70% and 80% under the normal 
daily load. But with SSDs, you can’t really trust fully the %util reported by 
sar, because SSDs perform best at large queue sizes. Average queue size 
reported by sar is between 5 and 10, so I’m not entirely sure if I should trust 
it or not for the %util values. A queue size of 8 and larger should be enough 
to keep most SSDs busy, but in that case I would expect %util to be 100% all 
the time.

Still, based on what sar is showing, I am concerned about hitting an I/O 
performance wall as the activity on that database might increase in the future, 
and I was trying to determine the actual limits, especially for writes. While 
doing that, I started to investigate the checkpoint behavior.

The database is configured with:

        - the default checkpoint timeout interval of 5 minutes (not set 
explicitly). It’s probably not long enough, I plan to increase it to something 
like 30 minutes, or more.
        - checkpoint_warning with the default at 30 seconds. Not sure if that 
is measured between the end of the previous checkpoint and the start of the 
next one, or between the start times (or end times) of two consecutive 
checkpoints.
        - completion target of 0.85, so it should write for 4:15 minutes every 
5 minutes.
        - max_wal_size = 16 GB, I plan to increase it to something like 96 GB.
        - 96 GB of shared buffers. I may increase them to something like 144 GB
        - bgwriter with 400 max pages and the default 200ms delay. That should 
result in 400 x 5 x 8kb pages / second max, so 16 MB/s max from the bgwriter. 
It’s probably too much and I should have left it to the default value of 100 
pages, or even lower. In fact I’m not even sure why I even need the bgwriter 
process instead of using just checkpoints for writes. In any case, even with 
400 pages, it should still leave some work for the checkpoint process.
        - wal_buffers of 32 MB, I had them as high as 256 MB as some point, but 
it seems values larger than 16MB are useless, I will reduce it to 16MB.

The log of the last checkpoints looks like this when the machine is idle in the 
morning:

2017-10-04 08:43:00.283 EDT [7831] LOG:  checkpoint starting: time
2017-10-04 08:47:15.887 EDT [7831] LOG:  checkpoint complete: wrote 19290 
buffers (0.2%); 0 transaction log file(s) added, 4 removed, 0 recycled; 
write=254.787 s, sync=0.794 s, total=255.604 s; sync files=274, longest=0.033 
s, average=0.002 s; distance=61864 kB, estimate=322398 kB

So it does take the 5 minutes and 0.85% seconds into account well. 300 x 0.85  
= 255 seconds, and it did indeed spread the writes over 255 seconds.

However, later during the day, as the activity increases it starts to look like 
this:

2017-10-04 08:48:00.932 EDT [7831] LOG:  checkpoint starting: time
2017-10-04 08:55:27.179 EDT [7831] LOG:  checkpoint complete: wrote 635481 
buffers (5.1%); 0 transaction log file(s) added, 4 removed, 0 recycled; 
write=442.133 s, sync=4.003 s, total=446.246 s; sync files=445, longest=0.118 
s, average=0.008 s; distance=5227434 kB, estimate=5227434 kB

First, a checkpoint finished in 446 seconds instead of 255 seconds. 635481 
buffers in 446 seconds, 1425 buffers/second, around 12 MB/second. I would say 
that’s not such a huge amount of I/O for the hardware. Then the next starts 
immediately: 

2017-10-04 08:55:27.180 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 09:05:23.391 EDT [7831] LOG:  checkpoint complete: wrote 2090424 
buffers (16.6%); 0 transaction log file(s) added, 319 removed, 0 recycled; 
write=595.690 s, sync=0.468 s, total=596.211 s; sync files=454, longest=0.020 
s, average=0.001 s; distance=11415997 kB, estimate=11415997 kB

This one takes even longer 596 seconds. 2090424 buffers, 3500 buffers/second, 
around 29 MB/second. That should still not be such a big deal, it would mean 
writing at around 15 MB/s on each SSD just for checkpoints.

So, not sure what is going on here. I could understand that the checkpoint 
process is surprised by the sudden increase in writes at the beginning of the 
day, and the first 2-3 checkpoints would take more than the configured 
duration. But even the next checkpoints follow the same pattern of taking more 
than 255 seconds, and sometimes more than 300 seconds:

2017-10-04 09:05:23.392 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 09:12:54.799 EDT [7831] LOG:  checkpoint complete: wrote 1342046 
buffers (10.7%); 0 transaction log file(s) added, 697 removed, 0 recycled; 
write=450.554 s, sync=0.791 s, total=451.406 s; sync files=706, longest=0.014 
s, average=0.0
01 s; distance=8478044 kB, estimate=11122202 kB
2017-10-04 09:12:54.801 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 09:17:49.897 EDT [7831] LOG:  checkpoint complete: wrote 1220559 
buffers (9.7%); 0 transaction log file(s) added, 376 removed, 141 recycled; 
write=292.734 s, sync=0.677 s, total=295.096 s; sync files=690, longest=0.026 
s, average=0.
000 s; distance=8551073 kB, estimate=10865089 kB
2017-10-04 09:17:54.903 EDT [7831] LOG:  checkpoint starting: time
2017-10-04 09:22:19.092 EDT [7831] LOG:  checkpoint complete: wrote 864467 
buffers (6.9%); 0 transaction log file(s) added, 213 removed, 309 recycled; 
write=258.673 s, sync=1.379 s, total=264.188 s; sync files=672, longest=0.034 
s, average=0.0
02 s; distance=5986866 kB, estimate=10377266 kB
2017-10-04 09:22:54.127 EDT [7831] LOG:  checkpoint starting: time
2017-10-04 09:27:22.678 EDT [7831] LOG:  checkpoint complete: wrote 962442 
buffers (7.6%); 0 transaction log file(s) added, 176 removed, 190 recycled; 
write=265.684 s, sync=0.603 s, total=268.550 s; sync files=689, longest=0.018 
s, average=0.0
00 s; distance=6278971 kB, estimate=9967437 kB
2017-10-04 09:27:22.679 EDT [7831] LOG:  checkpoint starting: xlog
2017-10-04 09:32:08.082 EDT [7831] LOG:  checkpoint complete: wrote 1206510 
buffers (9.6%); 0 transaction log file(s) added, 200 removed, 183 recycled; 
write=282.610 s, sync=0.622 s, total=285.402 s; sync files=683, longest=0.013 
s, average=0.000 s; distance=7412886 kB, estimate=9711982 kB
2017-10-04 09:32:22.097 EDT [7831] LOG:  checkpoint starting: time
2017-10-04 09:36:51.049 EDT [7831] LOG:  checkpoint complete: wrote 977391 
buffers (7.8%); 0 transaction log file(s) added, 218 removed, 234 recycled; 
write=264.946 s, sync=0.939 s, total=268.952 s; sync files=671, longest=0.032 
s, average=0.001 s; distance=6576530 kB, estimate=9398437 kB
2017-10-04 09:37:09.023 EDT [7831] LOG:  checkpoint starting: xlog
2017-10-04 09:41:41.421 EDT [7831] LOG:  checkpoint complete: wrote 1051219 
buffers (8.4%); 0 transaction log file(s) added, 245 removed, 157 recycled; 
write=269.147 s, sync=0.996 s, total=272.398 s; sync files=699, longest=0.026 
s, average=0.001 s; distance=6699634 kB, estimate=9128556 kB
2017-10-04 09:41:41.423 EDT [7831] LOG:  checkpoint starting: xlog
2017-10-04 09:46:36.460 EDT [7831] LOG:  checkpoint complete: wrote 1238858 
buffers (9.8%); 0 transaction log file(s) added, 260 removed, 149 recycled; 
write=292.544 s, sync=0.833 s, total=295.037 s; sync files=716, longest=0.033 
s, average=0.001 s; distance=7551236 kB, estimate=8970824 kB
2017-10-04 09:46:36.462 EDT [7831] LOG:  checkpoint starting: xlog
2017-10-04 09:51:03.839 EDT [7831] LOG:  checkpoint complete: wrote 1038125 
buffers (8.3%); 0 transaction log file(s) added, 177 removed, 283 recycled; 
write=263.307 s, sync=0.718 s, total=267.377 s; sync files=708, longest=0.027 
s, average=0.001 s; distance=6799399 kB, estimate=8753682 kB
2017-10-04 09:51:36.871 EDT [7831] LOG:  checkpoint starting: time
2017-10-04 09:56:30.606 EDT [7831] LOG:  checkpoint complete: wrote 913761 
buffers (7.3%); 0 transaction log file(s) added, 152 removed, 263 recycled; 
write=288.352 s, sync=0.933 s, total=293.735 s; sync files=697, longest=0.027 
s, average=0.001 s; distance=5756934 kB, estimate=8454007 kB
2017-10-04 09:56:30.607 EDT [7831] LOG:  checkpoint starting: xlog
2017-10-04 10:02:44.340 EDT [7831] LOG:  checkpoint complete: wrote 1080696 
buffers (8.6%); 0 transaction log file(s) added, 189 removed, 163 recycled; 
write=371.377 s, sync=0.569 s, total=373.732 s; sync files=712, longest=0.014 
s, average=0.000 s; distance=6734628 kB, estimate=8282069 kB
2017-10-04 10:02:44.341 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 10:10:13.903 EDT [7831] LOG:  checkpoint complete: wrote 1200040 
buffers (9.5%); 0 transaction log file(s) added, 360 removed, 51 recycled; 
write=448.221 s, sync=0.760 s, total=449.561 s; sync files=696, longest=0.021 
s, average=0.001 s; distance=7387465 kB, estimate=8192609 kB
2017-10-04 10:10:13.905 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 10:16:33.351 EDT [7831] LOG:  checkpoint complete: wrote 1472230 
buffers (11.7%); 0 transaction log file(s) added, 401 removed, 50 recycled; 
write=377.834 s, sync=0.979 s, total=379.445 s; sync files=700, longest=0.028 
s, average=0.001 s; distance=8555349 kB, estimate=8555349 kB
2017-10-04 10:16:33.352 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 10:21:58.810 EDT [7831] LOG:  checkpoint complete: wrote 1250086 
buffers (9.9%); 0 transaction log file(s) added, 378 removed, 144 recycled; 
write=322.737 s, sync=0.888 s, total=325.457 s; sync files=680, longest=0.031 
s, average=0.001 s; distance=7402273 kB, estimate=8440041 kB
2017-10-04 10:21:58.811 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 10:27:09.544 EDT [7831] LOG:  checkpoint complete: wrote 1171682 
buffers (9.3%); 0 transaction log file(s) added, 264 removed, 188 recycled; 
write=307.983 s, sync=0.796 s, total=310.732 s; sync files=688, longest=0.026 
s, average=0.001 s; distance=7018008 kB, estimate=8297838 kB
2017-10-04 10:27:09.546 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 10:34:16.507 EDT [7831] LOG:  checkpoint complete: wrote 1088387 
buffers (8.6%); 0 transaction log file(s) added, 343 removed, 85 recycled; 
write=425.445 s, sync=0.480 s, total=426.961 s; sync files=670, longest=0.012 
s, average=0.000 s; distance=6695430 kB, estimate=8137597 kB

Then as the activity on the database increases even more, and checkpoints start 
to take more than 10 minutes:

2017-10-04 10:34:16.509 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 10:45:29.233 EDT [7831] LOG:  checkpoint complete: wrote 1445250 
buffers (11.5%); 1 transaction log file(s) added, 409 removed, 0 recycled; 
write=671.755 s, sync=0.832 s, total=672.724 s; sync files=687, longest=0.039 
s, average=0.001 s; distance=8688405 kB, estimate=8688405 kB
2017-10-04 10:45:29.235 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 11:04:48.081 EDT [7831] LOG:  checkpoint complete: wrote 1929827 
buffers (15.3%); 1 transaction log file(s) added, 530 removed, 0 recycled; 
write=1156.147 s, sync=2.487 s, total=1158.846 s; sync files=704, longest=0.046 
s, average=0.003 s; distance=12851319 kB, estimate=12851319 kB

Then more than 20 minutes:

2017-10-04 11:04:48.082 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 11:25:55.806 EDT [7831] LOG:  checkpoint complete: wrote 2695524 
buffers (21.4%); 0 transaction log file(s) added, 784 removed, 0 recycled; 
write=1257.586 s, sync=10.070 s, total=1267.723 s; sync files=709, 
longest=0.117 s, average=0.014 s; distance=20710817 kB, estimate=20710817 kB
2017-10-04 11:25:55.807 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 11:48:08.366 EDT [7831] LOG:  checkpoint complete: wrote 2857107 
buffers (22.7%); 0 transaction log file(s) added, 1264 removed, 0 recycled; 
write=1326.199 s, sync=6.233 s, total=1332.558 s; sync files=726, longest=0.076 
s, average=0.008 s; distance=24439493 kB, estimate=24439493 kB
2017-10-04 11:48:08.368 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 12:11:39.550 EDT [7831] LOG:  checkpoint complete: wrote 3127276 
buffers (24.9%); 0 transaction log file(s) added, 1492 removed, 0 recycled; 
write=1403.735 s, sync=7.293 s, total=1411.182 s; sync files=732, longest=0.071 
s, average=0.009 s; distance=27967471 kB, estimate=27967471 kB
2017-10-04 12:11:39.552 EDT [7831] LOG:  checkpoint starting: xlog time
2017-10-04 12:32:28.853 EDT [7831] LOG:  checkpoint complete: wrote 2823207 
buffers (22.4%); 0 transaction log file(s) added, 1707 removed, 0 recycled; 
write=1236.280 s, sync=12.828 s, total=1249.300 s; sync files=739, 
longest=0.133 s, average=0.017 s; distance=28321252 kB, estimate=28321252 kB

There are no “checkpoints are occurring too frequently” messages in the logs, 
and since a checkpoint starts as soon as the previous one ends, not in over 30 
seconds as I configured the warning, in my opinion there should have been 
warnings, but there aren’t. Probably the warning measures start to start, or 
end to end, not end to start. The lack of a warning made me even more confused 
and made me lose a lot of time trying to figure this out, until I noticed the 
long durations.

In any case, I still don’t fully understand what is happening here. It looks 
like I’m hitting some hardware limits, but it’s not obvious. The CPU usage is 
around 20% user, 4% system, and Linux also reports 10-15% iowait, so the 
machine looks a bit stressed on the I/O side and a bit less on the CPU side, 
but again, I can’t be sure. 

The longest checkpoint today at 11:48 AM took 1411 seconds and has written 
3127276 buffers, so 2216 buffers/second. That’s just 18 MB/second, so 9 
MB/second on each SSD from checkpoints. In fact earlier in the day it had a 
shorter checkpoint that did 29 MB/second, so I think this larger one could have 
completed faster.

The sar disk I/O report was looking like this during this 1411 long checkpoint:

11:55:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     
await     svctm     %util
12:05:01 PM   dev8-16   2478.72  17412.49  64107.22     32.89      7.02      
2.83      0.31     77.79
12:05:01 PM    dev8-0   2139.80  10546.02  64153.14     34.91      7.16      
3.34      0.35     75.27
12:05:01 PM   dev8-32   2155.46  10881.22  64075.18     34.78      6.84      
3.17      0.35     75.08
12:05:01 PM   dev8-48   2497.76  17701.13  64153.14     32.77      7.90      
3.16      0.31     77.52
12:05:01 PM    dev9-0      0.00      0.00      0.00      0.00      0.00      
0.00      0.00      0.00
12:05:01 PM    dev9-1   5873.69  60126.95 126219.54     31.73      0.00      
0.00      0.00      0.00

9-1 is the RAID, and 8-0, 8-16, 8-32, 8-48 the SSDs. So around 32 MB/s writes 
and 7 MB/s reads on each SSD. In theory the SSDs should be easily able to 
handle that level of I/O, so I’m suspecting something else might be slowing the 
checkpoints.

Could there be some inefficiency or bug in the checkpoint process making it not 
work as fast as it should? Or something I may have configured improperly 
triggering this behavior?

Thanks,
Vlad

Reply via email to