Re: Possible crash on standby

2022-09-09 Thread Nathan Bossart
On Fri, Sep 09, 2022 at 10:51:10PM +0530, Bharath Rupireddy wrote:
> I think it is a duplicate of [1]. I have tested the above use-case
> with the patch at [1] and it fixes the issue.

I added this thread to the existing commitfest entry.  Thanks for pointing
this out.

https://commitfest.postgresql.org/39/3814

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Possible crash on standby

2022-09-09 Thread Bharath Rupireddy
On Fri, Sep 9, 2022 at 2:00 PM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> While I played  with some patch, I met an assertion failure.
>
> #2  0x00b350e0 in ExceptionalCondition (
> conditionName=0xbd8970 "!IsInstallXLogFileSegmentActive()",
> errorType=0xbd6e11 "FailedAssertion", fileName=0xbd6f28 "xlogrecovery.c",
> lineNumber=4190) at assert.c:69
> #3  0x00586f9c in XLogFileRead (segno=61, emode=13, tli=1,
> source=XLOG_FROM_ARCHIVE, notfoundOk=true) at xlogrecovery.c:4190
> #4  0x005871d2 in XLogFileReadAnyTLI (segno=61, emode=13,
> source=XLOG_FROM_ANY) at xlogrecovery.c:4296
> #5  0x0058656f in WaitForWALToBecomeAvailable (RecPtr=1023410360,
> randAccess=false, fetching_ckpt=false, tliRecPtr=1023410336, replayTLI=1,
> replayLSN=1023410336, nonblocking=false) at xlogrecovery.c:3727
>
> This is replayable by the following steps.
>
> 1. insert a sleep(1) in WaitForWALToBecomeAvailable().
> >* WAL that we restore from archive.
> >*/
> > + sleep(1);
> >   if (WalRcvStreaming())
> >   XLogShutdownWalRcv();
>
> 2. create a primary with archiving enabled.
>
> 3. create a standby with recovering from the primary's archive and
>   unconnectable primary_conninfo.
>
> 4. start the primary.
>
> 5. switch wal on the primary.
>
> 6. Kaboom.
>
> This is because WaitForWALToBecomeAvailable doesn't call
> XLogSHutdownWalRcv() when walreceiver has been stopped before we reach
> the WalRcvStreaming() call cited above. But we need to set
> InstasllXLogFileSegmentActive to false even in that case, since no one
> other than startup process does that.
>
> Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might
> need to correct the dependencies between the flag and walreceiver
> state, but it not mandatory because XLogShutdownWalRcv() is designed
> so that it can be called even after walreceiver is stopped.  I don't
> have a clear memory about why we do that at the time, though, but
> recovery check runs successfully with this.
>
> This code was introduced at PG12.

I think it is a duplicate of [1]. I have tested the above use-case
with the patch at [1] and it fixes the issue.

[1] 
https://www.postgresql.org/message-id/CALj2ACXPn_xePphnh88qmoQqqW%2BE2KEOdxGL%2BD-o9o7_XNGkkw%40mail.gmail.com

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Possible crash on standby

2022-09-09 Thread Nathan Bossart
On Fri, Sep 09, 2022 at 05:29:49PM +0900, Kyotaro Horiguchi wrote:
> This is because WaitForWALToBecomeAvailable doesn't call
> XLogSHutdownWalRcv() when walreceiver has been stopped before we reach
> the WalRcvStreaming() call cited above. But we need to set
> InstasllXLogFileSegmentActive to false even in that case, since no one
> other than startup process does that.

Nice find.

> Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might
> need to correct the dependencies between the flag and walreceiver
> state, but it not mandatory because XLogShutdownWalRcv() is designed
> so that it can be called even after walreceiver is stopped.  I don't
> have a clear memory about why we do that at the time, though, but
> recovery check runs successfully with this.

I suppose the alternative would be to set InstallXLogFileSegmentActive to
false in an 'else' block, but that doesn't seem necessary if
XLogShutdownWalRcv() is safe to call unconditionally.  So, unless there is
a bigger problem that I'm not seeing, +1 for your patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com