On Wed, Apr 26, 2017 at 3:17 AM, Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: > On 4/21/17 00:11, Michael Paquier wrote: >> Hmm. I have been actually looking at this solution and I am having >> doubts regarding its robustness. In short this would need to be >> roughly a two-step process: >> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to >> make it call ShutdownXLOG(). Prior doing that, a first signal should >> be sent to all the WAL senders with >> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be >> used. >> - At reception of this signal, all WAL senders switch to a stopping >> state, refusing commands that can generate WAL. >> - Checkpointer looks at the state of all WAL senders, looping with a >> sleep call of a couple of ms, refusing to launch the shutdown >> checkpoint as long as all WAL senders have not switched to the >> stopping state. >> - In reaper(), once checkpointer is confirmed as stopped, signal again >> the WAL senders, and tell them to perform the last loop. > > Yeah that looks like a reasonable approach. > > I'm not sure why in your patch you process got_SIGUSR2 in > WalSndErrorCleanup() instead of in the main loop.
Yes I was hesitating about this one when hacking it. Thinking an extra time, the similar check in StartReplication() should also not use got_SIGUSR2 to give the WAL sender a chance to do more work while the shutdown checkpoint is running as it could take minutes. Attached is an updated patch to reflect that. -- Michael
walsender-chkpt-v2.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers