RE: Resetting synchronous_standby_names can wait for CHECKPOINT to finish

2024-05-13 Thread Yusuke Egashira (Fujitsu)
Hello,

> When the checkpointer process is busy, even if we reset 
> synchronous_standby_names, the resumption of the backend processes waiting in 
> SyncRep are made to wait until the checkpoint is completed.
> This prevents the prompt resumption of application processing when a problem 
> occurs on the standby server in a synchronous replication system.
> I confirmed this in PostgreSQL 12.18.

I have tested this issue on Postgres built from the master branch (17devel) and 
observed the same behavior where the backend SyncRep release is blocked until 
CHECKPOINT completion.

In situations where a synchronous standby instance encounters an error and 
needs to be detached, I believe that the current behavior of waiting for 
SyncRep is inappropriate as it delays the backend.
I don't think changing the position of SIGHUP processing in the Checkpointer 
process carries much risk. Is there any oversight in my perception?


Regards, 
Yusuke Egashira.





Resetting synchronous_standby_names can wait for CHECKPOINT to finish

2024-04-14 Thread Yusuke Egashira (Fujitsu)
Hello, hackers.

When the checkpointer process is busy, even if we reset 
synchronous_standby_names, the resumption of the backend processes waiting in 
SyncRep are made to wait until the checkpoint is completed.
This prevents the prompt resumption of application processing when a problem 
occurs on the standby server in a synchronous replication system.
I confirmed this in PostgreSQL 12.18.

This issue has actually become a major problem for our customer. 
When a problem occurred in the replication network, even after resetting 
synchronous_standby_names, the backend processes did not respond, resulting in 
timeout errors in many client applications. 
The customer has also set the checkpoint_completion_target parameter to 0.9, 
and it seems to have been working fine under normal conditions.
However, there was a time when VACUUM was concentrated on a huge table. At that 
time, more than five times the max_wal_size of WAL output occurred during 
checkpoint processing. 
Unfortunately, communication with the synchronous standby was lost during that 
checkpoint processing, and despite resetting the synchronous_standby_names, 
multiple client applications could not return a response while waiting for 
SyncRep.


I wrote a script(reset-synchronous_standby_names-during-checkpoint.sh) to 
illustrate the issue. 
The script stops the synchronous standby during a transaction, and then resets 
synchronous_standby_names during checkpoint.
When I run this on my 1-core RHEL7 machine, I see that COMMIT does wait until 
the CHECKPOINT finishes, even though synchronous_standby_names has been reset.

I am attaching a patch (REL_12_STABLE) for the simplest seeming solution. 
This moves the handling of SIGHUP reception by the checkpointer outside of the 
sleep process. 
However, I am concerned that this change could affect the performance of 
checkpoint execution when there is a delay in the checkpoint schedule.
Can PostgreSQL tolerate this overhead?

Regards, 
Yusuke Egashira.



reset-synchronous_standby_names-during-checkpoint.sh
Description: reset-synchronous_standby_names-during-checkpoint.sh


v1-reset-synchronous_standby_names-timing.patch
Description: v1-reset-synchronous_standby_names-timing.patch