Re: [HACKERS] Another reason why the recovery tests take a long time

2017-06-26 Thread Simon Riggs
On 26 June 2017 at 19:06, Tom Lane wrote: > I wrote: >> So this looks like a pretty obvious race condition in the postmaster, >> which should be resolved by having it set a flag on receipt of >> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a >> new walreceiver. > > Concretely,

Re: [HACKERS] Another reason why the recovery tests take a long time

2017-06-26 Thread Tom Lane
I wrote: > So this looks like a pretty obvious race condition in the postmaster, > which should be resolved by having it set a flag on receipt of > PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a > new walreceiver. Concretely, I propose the attached patch. Together with reduci

Re: [HACKERS] Another reason why the recovery tests take a long time

2017-06-26 Thread Andres Freund
On 2017-06-26 13:42:52 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2017-06-26 12:32:00 -0400, Tom Lane wrote: > >> ... But I wonder whether it's intentional that the old > >> walreceiver dies in the first place. That FATAL exit looks suspiciously > >> like it wasn't originally-designed-

Re: [HACKERS] Another reason why the recovery tests take a long time

2017-06-26 Thread Tom Lane
Andres Freund writes: > On 2017-06-26 12:32:00 -0400, Tom Lane wrote: >> ... But I wonder whether it's intentional that the old >> walreceiver dies in the first place. That FATAL exit looks suspiciously >> like it wasn't originally-designed-in behavior. > It's quite intentional afaik - I've comp

Re: [HACKERS] Another reason why the recovery tests take a long time

2017-06-26 Thread Andres Freund
Hi, On 2017-06-26 12:32:00 -0400, Tom Lane wrote: > I've found another edge-case bug through investigation of unexpectedly > slow recovery test runs. It goes like this: > > * While streaming from master to slave, test script shuts down master > while slave is left running. We soon restart the

[HACKERS] Another reason why the recovery tests take a long time

2017-06-26 Thread Tom Lane
I've found another edge-case bug through investigation of unexpectedly slow recovery test runs. It goes like this: * While streaming from master to slave, test script shuts down master while slave is left running. We soon restart the master, but meanwhile: * slave's walreceiver process fails, r