On Mon, Dec 8, 2025 at 7:34 AM Zhijie Hou (Fujitsu) <[email protected]> wrote: > > Hi, > > Previously, the slotsync worker used SIGINT to receive a graceful shutdown > signal from the startup process on promotion. However, SIGINT is also used by > the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the > slotsync worker can access and lock catalog tables while parsing libpq tuples, > this overlapping use of SIGINT led to the slotsync worker ignoring > LOCK_TIMEOUT > signals and consequently waiting indefinitely on locks. > > I can reproduce the issue by: > > 1) create a failover replication slot for slotsync on primary. > 2) start slotsync worker on standby and uses gdb to make the slotsync > worker block before accessing pg_type catalog via walrcv_exec -> > libpqrcv_exec -> > libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1. > 3) take ACCESS EXCLUSIVE lock on pg_type on primary. > 4) log standby snapshot to replicate the lock to standby. > 5) release the slotsync worker, it will start waiting for the lock on pg_type > to > be released. And on HEAD, it would not be canceled by the lock_timeout > setting. > > Here is a patch to resolve this by replacing the current signal handler with > the > appropriate StatementCancelHandler for SIGINT within the slotsync worker. > Furthermore, it updates the startup process to send a SIGUSR1 signal to notify > slotsync of the need to stop during promotion. The slotsync worker now stops > upon detecting that the shared memory flag (stopSignaled) is set to true. > > I did not add a tap-test in the patch for now. Although feasible, it requires > a strong lock on a catalog and an injection point to control the > process. >
Thanks for the patch. I agree with the issue mentioned and can reproduce it on HEAD; verified that the patch fixes it. The patch looks good to me. thanks Shveta
