Further updates:

Yesterday checkpoints were finishing more or less on time with the 
configuration for 25 minutes out of 30 minutes, taking 26 minutes at most.

So for today I reduced the time reserved for checkpoint writes to 20 minutes 
out of 30 minutes, by setting checkpoint_completion_target to 0.66 instead of 
0.83, and I disabled the bgwriter completely by setting bgwriter_lru_maxpages 
from 100 to 0. 

I expected this might cause the checkpoints to take more time, and not finish 
in 20 minutes, but that didn’t happen.

For the first part of the day, when the activity was lower, checkpoints 
finished almost on time, less than 30 seconds more than configured:

2017-10-06 10:30:56.248 EDT [4927] LOG:  checkpoint starting: time
2017-10-06 10:51:26.227 EDT [4927] LOG:  checkpoint complete: wrote 4364979 
buffers (23.1%); 0 transaction log file(s) added, 0 removed, 1936 recycled; 
write=1187.940 s, sync=1.888 s, total=1229.979 s; sync files=744, longest=0.041 
s, average=0.002 s; distance=31073898 kB, estimate=32879405 kB

Then later, as the activity increased, something I didn’t expect happened. The 
checkpoints started to finish much sooner than configured, in 16:38 minutes 
instead of 20 minutes, so 3:22 minutes earlier than configured:

2017-10-06 11:00:56.347 EDT [4927] LOG:  checkpoint starting: time
2017-10-06 11:17:34.778 EDT [4927] LOG:  checkpoint complete: wrote 5376422 
buffers (28.5%); 0 transaction log file(s) added, 0 removed, 1897 recycled; 
write=959.908 s, sync=1.578 s, total=998.431 s; sync files=772, longest=0.045 
s, average=0.002 s; distance=37712624 kB, estimate=37712624 kB

Then, even stranger, the next checkpoint started sooner than configured, by 
2:11 minutes, with the reason “xlog” instead of “time”:

2017-10-06 11:27:48.754 EDT [4927] LOG:  checkpoint starting: xlog
2017-10-06 11:42:00.131 EDT [4927] LOG:  checkpoint complete: wrote 6123787 
buffers (32.4%); 0 transaction log file(s) added, 239 removed, 2063 recycled; 
write=806.887 s, sync=1.032 s, total=851.377 s; sync files=767, longest=0.052 
s, average=0.001 s; distance=40254253 kB, estimate=40254253 kB

All subsequent checkpoints for the rest of the day look similar, they start 
with the reason “xlog”, and finish in around 15 minutes each instead of 20, and 
the time between two checkpoint starts is around 25 minutes instead of 30.

I’m not sure what causes these early finishes and early starts. The checkpoint 
start message is not very helpful in determining why it started early, in my 
opinion. It’s saying “xlog”, but a more verbose description, maybe even 
including some numbers it based its decision to start the checkpoint on, would 
be more helpful.

I’m guessing the reason for starting a checkpoint is that it estimated the 96 
GB reserved with max_wal_size would be filled unless it starts a checkpoint 
earlier than configured. I’ll increase the max_wal_size from 96 GB to 144 GB 
for Monday, to see if it changes this behavior.

Thanks,
Vlad

Reply via email to