Re: Strange failure on mamba

2022-11-30 Thread Tom Lane
Andres Freund writes: > On 2022-11-30 18:33:06 -0500, Tom Lane wrote: >> Even if somebody comes up with a rewrite to avoid doing interesting stuff in >> the postmaster's signal handlers, we surely wouldn't risk back-patching it. > Would that actually fix anything, given netbsd's brokenness? If

Re: Strange failure on mamba

2022-11-30 Thread Andres Freund
Hi, On 2022-11-30 18:33:06 -0500, Tom Lane wrote: > Also, I dug into my stuck processes some more, and I have to take > back the claim that this is happening later than postmaster startup. > All the stuck children are ones that either are launched on request > from the startup process, or are

Re: Strange failure on mamba

2022-11-30 Thread Tom Lane
Andres Freund writes: > On 2022-11-30 00:55:42 -0500, Tom Lane wrote: >> Googling LD_BIND_NOW suggests that that's a Linux thing; do you know that >> it should have an effect on NetBSD? > I'm not at all sure it does, but I did see it listed in > https://man.netbsd.org/ld.elf_so.1 >

Re: Strange failure on mamba

2022-11-29 Thread Andres Freund
Hi, On 2022-11-29 22:31:50 -0800, Andres Freund wrote: > On 2022-11-30 00:55:42 -0500, Tom Lane wrote: > > Andres Freund writes: > > > What libraries is postgres linked against? I don't know whether -z now > > > only > > > affects the "top-level" dependencies of postgres, or also the > > >

Re: Strange failure on mamba

2022-11-29 Thread Andres Freund
Hi, On 2022-11-30 00:55:42 -0500, Tom Lane wrote: > Andres Freund writes: > > What libraries is postgres linked against? I don't know whether -z now only > > affects the "top-level" dependencies of postgres, or also the dependencies > > of > > shared libraries that haven't been built with -z

Re: Strange failure on mamba

2022-11-29 Thread Tom Lane
Andres Freund writes: > On 2022-11-29 20:44:34 -0500, Tom Lane wrote: >> It's also strange that we're apparently running with signals enabled >> whereever it is that rtld_bind is getting called from. Could it be that >> sigaction is failing to install the requested signal mask, so that one >>

Re: Strange failure on mamba

2022-11-29 Thread Andres Freund
Hi, On 2022-11-29 20:44:34 -0500, Tom Lane wrote: > It's also strange that we're apparently running with signals enabled > whereever it is that rtld_bind is getting called from. Could it be that > sigaction is failing to install the requested signal mask, so that one > postmaster signal handler

Re: Strange failure on mamba

2022-11-29 Thread Tom Lane
Andres Freund writes: > On 2022-11-29 20:44:34 -0500, Tom Lane wrote: >> Backtrace stopped: frame did not save the PC > Do you have any idea why the stack can't be unwound further here? Is it > possibly indicative of a corrupted stack? I guess we'd need to dig into > the netbsd libc code :( I

Re: Strange failure on mamba

2022-11-29 Thread Andres Freund
Hi, On 2022-11-29 20:44:34 -0500, Tom Lane wrote: > Thanks to commit 51b5834cd I've now been able to capture some info > from mamba's last couple of failures [1][2]. Sure enough, what is > happening is that postmaster children are getting stuck in recursive > rtld symbol resolution. A couple of

Re: Strange failure on mamba

2022-11-29 Thread Thomas Munro
On Wed, Nov 30, 2022 at 3:43 PM Thomas Munro wrote: > sigaction(0, NULL, ) s/sigaction/sigprocmask/

Re: Strange failure on mamba

2022-11-29 Thread Thomas Munro
On Wed, Nov 30, 2022 at 2:44 PM Tom Lane wrote: > Now, we certainly cannot think that these are occurring early in > postmaster startup. In the wake of 8acd8f869, we should expect > that there's no further need to call rtld_bind at all in the > postmaster, but seemingly that's not so. It's very

Re: Strange failure on mamba

2022-11-29 Thread Tom Lane
I wrote: > Thomas Munro writes: >> On Fri, Nov 18, 2022 at 11:08 AM Tom Lane wrote: >>> mamba has been showing intermittent failures in various replication >>> tests since day one. >> I wonder if it's a runtime variant of the other problem. We do >> load_file("libpqwalreceiver", false) before

Re: Sending SIGABRT to child processes (was Re: Strange failure on mamba)

2022-11-21 Thread Tom Lane
I wrote: > Andres Freund writes: >> I suspect that having a GUC would be a good idea. I needed something similar >> recently, debugging an occasional hang in the AIO patchset. I first tried >> something like your #define approach and it did cause a problematic flood of >> core files. > Yeah, the

Sending SIGABRT to child processes (was Re: Strange failure on mamba)

2022-11-18 Thread Tom Lane
Andres Freund writes: > On 2022-11-17 17:47:50 -0500, Tom Lane wrote: >> So I'd like to have some way to make the postmaster send SIGABRT instead >> of SIGKILL in the buildfarm environment. The lowest-tech way would be >> to drive that off some #define or other. We could scale it up to a GUC >>

Re: Strange failure on mamba

2022-11-17 Thread Andres Freund
Hi, On 2022-11-17 17:47:50 -0500, Tom Lane wrote: > Yeah, that or some other NetBSD bug could be the explanation, too. > Without a stack trace it's hard to have any confidence about it, > but I've been unable to reproduce the problem outside the buildfarm. > (Which is a familiar refrain. I

Re: Strange failure on mamba

2022-11-17 Thread Thomas Munro
On Fri, Nov 18, 2022 at 11:35 AM Thomas Munro wrote: > I wonder if it's a runtime variant of the other problem. We do > load_file("libpqwalreceiver", false) before unblocking signals but > maybe don't resolve the symbols until calling them, or something like > that... Hmm, no, I take that back.

Re: Strange failure on mamba

2022-11-17 Thread Tom Lane
Thomas Munro writes: > On Fri, Nov 18, 2022 at 11:08 AM Tom Lane wrote: >> mamba has been showing intermittent failures in various replication >> tests since day one. > I wonder if it's a runtime variant of the other problem. We do > load_file("libpqwalreceiver", false) before unblocking

Re: Strange failure on mamba

2022-11-17 Thread Thomas Munro
On Fri, Nov 18, 2022 at 11:08 AM Tom Lane wrote: > Thomas Munro writes: > > I wonder why the walreceiver didn't start in > > 008_min_recovery_point_node_3.log here: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38 > > mamba has been showing intermittent

Re: Strange failure on mamba

2022-11-17 Thread Tom Lane
Thomas Munro writes: > I wonder why the walreceiver didn't start in > 008_min_recovery_point_node_3.log here: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38 mamba has been showing intermittent failures in various replication tests since day one. My

Strange failure on mamba

2022-11-17 Thread Thomas Munro
Hi, I wonder why the walreceiver didn't start in 008_min_recovery_point_node_3.log here: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38 There was the case of commit 8acd8f86, but that involved a deadlocked postmaster whereas this one still handled a