Hello, hackers.
When the checkpointer process is busy, even if we reset
synchronous_standby_names, the resumption of the backend processes waiting in
SyncRep are made to wait until the checkpoint is completed.
This prevents the prompt resumption of application processing when a problem
occurs on the standby server in a synchronous replication system.
I confirmed this in PostgreSQL 12.18.
This issue has actually become a major problem for our customer.
When a problem occurred in the replication network, even after resetting
synchronous_standby_names, the backend processes did not respond, resulting in
timeout errors in many client applications.
The customer has also set the checkpoint_completion_target parameter to 0.9,
and it seems to have been working fine under normal conditions.
However, there was a time when VACUUM was concentrated on a huge table. At that
time, more than five times the max_wal_size of WAL output occurred during
checkpoint processing.
Unfortunately, communication with the synchronous standby was lost during that
checkpoint processing, and despite resetting the synchronous_standby_names,
multiple client applications could not return a response while waiting for
SyncRep.
I wrote a script(reset-synchronous_standby_names-during-checkpoint.sh) to
illustrate the issue.
The script stops the synchronous standby during a transaction, and then resets
synchronous_standby_names during checkpoint.
When I run this on my 1-core RHEL7 machine, I see that COMMIT does wait until
the CHECKPOINT finishes, even though synchronous_standby_names has been reset.
I am attaching a patch (REL_12_STABLE) for the simplest seeming solution.
This moves the handling of SIGHUP reception by the checkpointer outside of the
sleep process.
However, I am concerned that this change could affect the performance of
checkpoint execution when there is a delay in the checkpoint schedule.
Can PostgreSQL tolerate this overhead?
Regards,
Yusuke Egashira.
reset-synchronous_standby_names-during-checkpoint.sh
Description: reset-synchronous_standby_names-during-checkpoint.sh
v1-reset-synchronous_standby_names-timing.patch
Description: v1-reset-synchronous_standby_names-timing.patch