Hi, On 2023-02-01 16:21:16 +1300, Thomas Munro wrote: > My database off assertion failures has four like that, all 15 and master: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001:05:17 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-01-11%2011:16:40 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-11-22%2012:19:21 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-11-17%2021:47:02 > > It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM > handler which is allowed to call that while in_restore_command is > true.
Ugh, no wonder we're getting crashes. This whole business seems bogus as hell. RestoreArchivedFile(): ... /* * Check signals before restore command and reset afterwards. */ PreRestoreCommand(); /* * Copy xlog from archival storage to XLOGDIR */ ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname); PostRestoreCommand(); /* SIGTERM: set flag to abort redo and exit */ static void StartupProcShutdownHandler(SIGNAL_ARGS) { int save_errno = errno; if (in_restore_command) proc_exit(1); else shutdown_requested = true; WakeupRecovery(); errno = save_errno; } Where PreRestoreCommand()/PostRestoreCommand() set in_restore_command. There's *a lot* of stuff happening inside shell_restore() that's not compatible with doing proc_exit() inside a signal handler. We're allocating memory! Interact with stdout. There's also the fact that system() isn't signal safe, but that's a much less likely problematic issue. This appears to have gotten worse over a sequence of commits. The following commits each added something betwen PreRestoreCommand() and PostRestoreCommand(). commit 1b06d7bac901e5fd20bba597188bae2882bf954b Author: Fujii Masao <fu...@postgresql.org> Date: 2021-11-22 10:28:21 +0900 Report wait events for local shell commands like archive_command. added pgstat_report_wait_start/end. Unlikely to cause big issues, but not good. commit 7fed801135bae14d63b11ee4a10f6083767046d8 Author: Tom Lane <t...@sss.pgh.pa.us> Date: 2022-08-29 13:55:38 -0400 Clean up inconsistent use of fflush(). Made it a bit worse by adding an fflush(). That certainly seems like it could cause hangs. commit 9a740f81eb02e04179d78f3df2ce671276c27b07 Author: Michael Paquier <mich...@paquier.xyz> Date: 2023-01-16 16:31:43 +0900 Refactor code in charge of running shell-based recovery commands which completely broke the mechanism. We suddenly run the entirety of shell_restore(), which does pallocs etc to build the string passed to system, and raises errors, all within a section in which a signal handler can invoke proc_exit(). That's just completely broken. Sorry, but particularly in this area, you got to be a heck of a lot more careful. I don't see a choice but to revert the recent changes. They need a fairly large rewrite. Greetings, Andres Freund