Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime

Vladimir Nicolici Tue, 10 Oct 2017 07:31:57 -0700

No, it didn’t. The delete was done in a single transaction.

From: Achilleas Mantzios
Sent: Tuesday, October 10, 2017 17:18
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime


Hello Vladimir,

maybe your update triggered auto_vacuum on those tables ? Default 
autovacuum_freeze_max_age is exactly set at 200,000,000 . Did you check your 
vacuum stats afterwards (pg_stat_*_tables) ?
Can you show the code which performed the deletes?

On 10/10/2017 16:56, Vladimir Nicolici wrote:
I experimented some more with the settings this weekend, while doing some large 
write operations (deleting 200 million records from a table), and I realized 
that the database is capable of generating much more WAL than I estimated.
 
And it seems that spikes in write activity, when longer than a few minutes, can 
cause the checkpoint process to “panic” and start a checkpoint earlier, and 
trying to complete it as soon as possible, estimating, correctly, that if that 
level of activity continues it will hit the max_wal_size limit.
 
Based on that, I reduced the checkpoint_timeout from 30 minutes to 20 minutes, 
while keeping max_wal_size at 144GB . Alternatively I could have increased the 
maximum WAL size more, but I’m not sure it’s a good idea to set it higher than 
the shared buffers, which are also set at 144GB. After this change, on Monday 
all checkpoints were triggered by “time”, I didn’t have any more checkpoints 
triggered by “xlog”.
 
I also set checkpoint_completion_target to 0.5 to see if our hardware can 
handle concentrating the write activity for 20 minutes in just 10 minutes, and 
that worked very well too, checkpoints finished on time. The %util (busy%) for 
the SSDs as reported by sar was around 20% when not doing a checkpoint, and 60% 
during the checkpoint, so it seems the hardware will be able to handle future 
increases in activity just fine.
 
The lesson I learned here is that max_wal_size needs to be configured based on 
the *maximum* volume of wal the database can generate in the checkpoint_timeout 
interval. Initially I had it set based on the *average* volume of wal generated 
in that interval, setting it to 3 times that average, but that was not enough, 
triggering the unexpected behavior.
 
Thanks,
Vlad

-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime

Reply via email to