On Mon, Feb 14, 2022 at 1:44 PM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > > At Fri, 11 Feb 2022 22:25:49 +0530, Bharath Rupireddy > <bharath.rupireddyforpostg...@gmail.com> wrote in > > > I don't think > > > > just making InstallXLogFileSegmentActive false is enough. By looking > > > > at the comment [1], it doesn't make sense to move ahead for restoring > > > > from the archive location without the WAL receiver fully stopped. > > > > IMO, the real fix is to just remove WalRcvStreaming() and call > > > > XLogShutdownWalRcv() unconditionally. Anyways, we have the > > > > Assert(!WalRcvStreaming()); down below. I don't think it will create > > > > any problem. > > > > > > If WalRcvStreaming() is returning false that means walreceiver is > > > already stopped so we don't need to shutdown it externally. I think > > > like we are setting this flag outside start streaming we can reset it > > > also outside XLogShutdownWalRcv. Or I am fine even if we call > > > XLogShutdownWalRcv() because if walreceiver is stopped it will just > > > reset the flag we want it to reset and it will do nothing else. > > > > As I said, I'm okay with just calling XLogShutdownWalRcv() > > unconditionally as it just returns in case walreceiver has already > > stopped and updates the InstallXLogFileSegmentActive flag to false. > > > > Let's also hear what other hackers have to say about this. > > Firstly, good catch:) And the direction seems right. > > It seems like an overlook of cc2c7d65fc. We cannot install new wal > segments only while we're in archive recovery. Conversely, we must > turn off it when entering archive recovery (not exiting streaming > recovery). So, *I* feel like to do that at the beginning of > XLOG_FROM_ARCHIVE/PG_WAL rather than the end of XLOG_FROM_STREAM. > > (And I would like to remove XLogShutDownWalRcv() and turn off the flag > in StartupXLOG explicitly, but it would be overdone.) > > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -12800,6 +12800,16 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool > randAccess, > */ > Assert(!WalRcvStreaming()); > > + /* > + * WAL segment installation conflicts with > archive > + * recovery. Make sure it is turned off. > XLogShutdownWalRcv() > + * does that but it is not done when the > process has voluntary > + * exited for example for replication timeout. > + */ > + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); > + XLogCtl->InstallXLogFileSegmentActive = false; > + LWLockRelease(ControlFileLock); > + > /* Close any old file we might have open. */ > if (readFile >= 0)
Today I encountered the assertion failure [2] twice while working on another patch [1]. The pattern seems to be that the walreceiver got killed or crashed and set it's state to WALRCV_STOPPING or WALRCV_STOPPED by the team the WAL state machine moves to archive and hence the following XLogShutdownWalRcv() code will not get hit: /* * Before we leave XLOG_FROM_STREAM state, make sure that * walreceiver is not active, so that it won't overwrite * WAL that we restore from archive. */ if (WalRcvStreaming()) ShutdownWalRcv(); I agree with Kyotaro-san to reset InstallXLogFileSegmentActive before entering into the archive mode. Hence I tweaked the code introduced by the following commit a bit, the result v1 patch is attached herewith. Please review it. commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16 Author: Noah Misch <n...@leadboat.com> Date: Mon Jun 28 18:34:56 2021 -0700 Skip WAL recycling and preallocation during archive recovery. [1] https://www.postgresql.org/message-id/CALj2ACUYz1z6QPduGn5gguCkfd-ko44j4hKcOMtp6fzv9xEWgw%40mail.gmail.com [2] 2022-08-11 05:10:29.051 UTC [1487086] FATAL: terminating walreceiver due to timeout cp: cannot stat '/home/ubuntu/archived_wal/00000002.history': No such file or directory 2022-08-11 05:10:29.054 UTC [1487075] LOG: switched WAL source from stream to archive after failure 2022-08-11 05:10:29.067 UTC [1487075] LOG: restored log file "000000010000000000000034" from archive TRAP: FailedAssertion("!IsInstallXLogFileSegmentActive()", File: "xlogrecovery.c", Line: 4149, PID: 1487075) postgres: startup waiting for 000000010000000000000034(ExceptionalCondition+0xd6)[0x559d20e6deb1] postgres: startup waiting for 000000010000000000000034(+0x1e4e14)[0x559d20813e14] postgres: startup waiting for 000000010000000000000034(+0x1e50c0)[0x559d208140c0] postgres: startup waiting for 000000010000000000000034(+0x1e4354)[0x559d20813354] postgres: startup waiting for 000000010000000000000034(+0x1e3946)[0x559d20812946] postgres: startup waiting for 000000010000000000000034(+0x1dbc58)[0x559d2080ac58] postgres: startup waiting for 000000010000000000000034(+0x1db3ce)[0x559d2080a3ce] postgres: startup waiting for 000000010000000000000034(XLogReadAhead+0x3d)[0x559d2080aa23] postgres: startup waiting for 000000010000000000000034(+0x1d9713)[0x559d20808713] postgres: startup waiting for 000000010000000000000034(+0x1d909c)[0x559d2080809c] postgres: startup waiting for 000000010000000000000034(XLogPrefetcherReadRecord+0xf9)[0x559d208092a2] postgres: startup waiting for 000000010000000000000034(+0x1e3427)[0x559d20812427] postgres: startup waiting for 000000010000000000000034(PerformWalRecovery+0x465)[0x559d2080fd0f] postgres: startup waiting for 000000010000000000000034(StartupXLOG+0x987)[0x559d207fc732] postgres: startup waiting for 000000010000000000000034(StartupProcessMain+0xf8)[0x559d20bc3990] postgres: startup waiting for 000000010000000000000034(AuxiliaryProcessMain+0x1ae)[0x559d20bb62cf] postgres: startup waiting for 000000010000000000000034(+0x593020)[0x559d20bc2020] postgres: startup waiting for 000000010000000000000034(+0x591ebb)[0x559d20bc0ebb] postgres: startup waiting for 000000010000000000000034(+0x590907)[0x559d20bbf907] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe518b26520] /lib/x86_64-linux-gnu/libc.so.6(__select+0xbd)[0x7fe518bff74d] postgres: startup waiting for 000000010000000000000034(+0x58e1bd)[0x559d20bbd1bd] postgres: startup waiting for 000000010000000000000034(PostmasterMain+0x14cc)[0x559d20bbcaa2] postgres: startup waiting for 000000010000000000000034(+0x44a752)[0x559d20a79752] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe518b0dd90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe518b0de40] postgres: startup waiting for 000000010000000000000034(_start+0x25)[0x559d206fe5b5] 2022-08-11 05:10:29.161 UTC [1486413] LOG: startup process (PID 1487075) was terminated by signal 6: Aborted 2022-08-11 05:10:29.161 UTC [1486413] LOG: terminating any other active server processes 2022-08-11 05:10:29.169 UTC [1486413] LOG: shutting down due to startup process failure 2022-08-11 05:10:29.181 UTC [1486413] LOG: database system is shut down 2022-08-11 05:34:54.234 UTC [1509079] LOG: switched WAL source from stream to archive after failure 2022-08-11 05:34:54.234 UTC [1509079] LOG: current WAL source is archive and last WAL source is stream, standby mode 1 2022-08-11 05:34:54.248 UTC [1509079] LOG: restored log file "000000010000000000000003" from archive TRAP: FailedAssertion("!IsInstallXLogFileSegmentActive()", File: "xlogrecovery.c", Line: 4155, PID: 1509079) postgres: startup waiting for 000000010000000000000003(ExceptionalCondition+0xd6)[0x55e9cb1cbf2a] postgres: startup waiting for 000000010000000000000003(+0x1e4e8d)[0x55e9cab71e8d] postgres: startup waiting for 000000010000000000000003(+0x1e5139)[0x55e9cab72139] postgres: startup waiting for 000000010000000000000003(+0x1e43cd)[0x55e9cab713cd] postgres: startup waiting for 000000010000000000000003(+0x1e3946)[0x55e9cab70946] postgres: startup waiting for 000000010000000000000003(+0x1dbc58)[0x55e9cab68c58] postgres: startup waiting for 000000010000000000000003(+0x1db0b7)[0x55e9cab680b7] postgres: startup waiting for 000000010000000000000003(XLogReadAhead+0x3d)[0x55e9cab68a23] postgres: startup waiting for 000000010000000000000003(+0x1d9713)[0x55e9cab66713] postgres: startup waiting for 000000010000000000000003(+0x1d909c)[0x55e9cab6609c] postgres: startup waiting for 000000010000000000000003(XLogPrefetcherReadRecord+0xf9)[0x55e9cab672a2] postgres: startup waiting for 000000010000000000000003(+0x1e3427)[0x55e9cab70427] postgres: startup waiting for 000000010000000000000003(+0x1e4911)[0x55e9cab71911] postgres: startup waiting for 000000010000000000000003(InitWalRecovery+0xaf4)[0x55e9cab6bc4e] postgres: startup waiting for 000000010000000000000003(StartupXLOG+0x4af)[0x55e9cab5a25a] postgres: startup waiting for 000000010000000000000003(StartupProcessMain+0xf8)[0x55e9caf21a09] postgres: startup waiting for 000000010000000000000003(AuxiliaryProcessMain+0x1ae)[0x55e9caf14348] postgres: startup waiting for 000000010000000000000003(+0x593099)[0x55e9caf20099] postgres: startup waiting for 000000010000000000000003(PostmasterMain+0x1476)[0x55e9caf1aac5] postgres: startup waiting for 000000010000000000000003(+0x44a7cb)[0x55e9cadd77cb] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f754c287d90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f754c287e40] postgres: startup waiting for 000000010000000000000003(_start+0x25)[0x55e9caa5c5b5] 2022-08-11 05:34:54.336 UTC [1509076] LOG: startup process (PID 1509079) was terminated by signal 6: Aborted 2022-08-11 05:34:54.336 UTC [1509076] LOG: aborting startup due to startup process failure 2022-08-11 05:34:54.337 UTC [1509076] LOG: database system is shut down -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
v1-0001-Avoid-race-condition-in-resetting-XLogCtl-Install.patch
Description: Binary data