Hi, Previously, the slotsync worker used SIGINT to receive a graceful shutdown signal from the startup process on promotion. However, SIGINT is also used by the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the slotsync worker can access and lock catalog tables while parsing libpq tuples, this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT signals and consequently waiting indefinitely on locks.
I can reproduce the issue by: 1) create a failover replication slot for slotsync on primary. 2) start slotsync worker on standby and uses gdb to make the slotsync worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec -> libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1. 3) take ACCESS EXCLUSIVE lock on pg_type on primary. 4) log standby snapshot to replicate the lock to standby. 5) release the slotsync worker, it will start waiting for the lock on pg_type to be released. And on HEAD, it would not be canceled by the lock_timeout setting. Here is a patch to resolve this by replacing the current signal handler with the appropriate StatementCancelHandler for SIGINT within the slotsync worker. Furthermore, it updates the startup process to send a SIGUSR1 signal to notify slotsync of the need to stop during promotion. The slotsync worker now stops upon detecting that the shared memory flag (stopSignaled) is set to true. I did not add a tap-test in the patch for now. Although feasible, it requires a strong lock on a catalog and an injection point to control the process. Best Regards, Hou zj
v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch
Description: v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch
