Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Petr Jelinek
On 03/10/17 22:01, Thomas Munro wrote: > On Wed, Oct 4, 2017 at 7:32 AM, Petr Jelinek > wrote: >> On 03/10/17 19:44, Tom Lane wrote: >>> I wrote: > So that's trouble waiting to happen, for sure. At the very least we > need to do a single fetch of

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Thomas Munro
On Wed, Oct 4, 2017 at 7:32 AM, Petr Jelinek wrote: > On 03/10/17 19:44, Tom Lane wrote: >> I wrote: So that's trouble waiting to happen, for sure. At the very least we need to do a single fetch of WalRcv->latch, not two. I wonder whether even that

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Petr Jelinek
On 03/10/17 19:44, Tom Lane wrote: > I wrote: >>> So that's trouble waiting to happen, for sure. At the very least we >>> need to do a single fetch of WalRcv->latch, not two. I wonder whether >>> even that is sufficient, though: this coding requires an atomic fetch of >>> a pointer, which is

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Tom Lane
I wrote: >> So that's trouble waiting to happen, for sure. At the very least we >> need to do a single fetch of WalRcv->latch, not two. I wonder whether >> even that is sufficient, though: this coding requires an atomic fetch of >> a pointer, which is something we generally don't assume to be

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Tom Lane
I wrote: > So that's trouble waiting to happen, for sure. At the very least we > need to do a single fetch of WalRcv->latch, not two. I wonder whether > even that is sufficient, though: this coding requires an atomic fetch of > a pointer, which is something we generally don't assume to be safe.

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Tom Lane
Alvaro Herrera writes: > Tom Lane wrote: >> Alvaro Herrera writes: >>> I think the latch is only used locally. Seems that it was only put in >>> shmem to avoid a separate variable ... >> Hm, I'm strongly tempted to move it to a separate static

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Alvaro Herrera
Tom Lane wrote: > Alvaro Herrera writes: > > Tom Lane wrote: > >> I don't especially like the Asserts inside spinlocks, either. > > > I didn't change these. It doesn't look to me that these asserts are > > worth very much as production code. > > OK. If we ever see

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Tom Lane
Alvaro Herrera writes: > Fixed the pstrdup problem by replacing with strlcpy() to stack-allocated > variables (rather than palloc + memcpy as proposed in Michael's patch). +1 > Tom Lane wrote: >> I don't especially like the Asserts inside spinlocks, either. > I didn't

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Alvaro Herrera
Fixed the pstrdup problem by replacing with strlcpy() to stack-allocated variables (rather than palloc + memcpy as proposed in Michael's patch). About the other problems: Tom Lane wrote: > I took a quick look through walreceiver.c, and there are a couple of > obvious problems of the same ilk in

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Alvaro Herrera
Andreas Seltenreich wrote: > Michael Paquier writes: > > > I am attaching a patch that addresses the bugs for the spin lock sections. > > [2. text/x-diff; walreceiver-spin-calls.patch] > > I haven't seen a spinlock PANIC since testing with the patch applied. > They occured five times with the

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Andreas Seltenreich
Michael Paquier writes: > I am attaching a patch that addresses the bugs for the spin lock sections. > [2. text/x-diff; walreceiver-spin-calls.patch] I haven't seen a spinlock PANIC since testing with the patch applied. They occured five times with the same amount of testing done earlier.

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-02 Thread Michael Paquier
On Tue, Oct 3, 2017 at 6:54 AM, Tom Lane wrote: > I wrote: >> If this is the only problem then I'd agree we should stick to a spinlock >> (I assume the strings in question can't be very long). I was thinking >> more about what to do if we find other violations that are harder

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-02 Thread Tom Lane
I wrote: > If this is the only problem then I'd agree we should stick to a spinlock > (I assume the strings in question can't be very long). I was thinking > more about what to do if we find other violations that are harder to fix. I took a quick look through walreceiver.c, and there are a

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-02 Thread Tom Lane
Andres Freund writes: > On 2017-10-02 17:30:25 -0400, Tom Lane wrote: >> Or replace the spinlock with an LWLock? > That'd probably be a good idea, but I'm loathe to do so in the back > branches. Not at this callsite, but some others, there's some potential > for contention.

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-02 Thread Andres Freund
On 2017-10-02 17:30:25 -0400, Tom Lane wrote: > Andres Freund writes: > > Yes, that'd be a bad idea. It's not great to have memcpys in a critical > > section, but it's way better than pallocs. So we need to use some local > > buffers that this get copied to. > > Or replace

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-02 Thread Tom Lane
Andres Freund writes: > On 2017-10-02 22:56:49 +0200, Andreas Seltenreich wrote: >> low-memory testing with REL_10_STABLE at 1f19550a87 produced the >> following PANIC: >> stuck spinlock detected at pg_stat_get_wal_receiver, walreceiver.c:1397 > Ugh. Egad. > Yes, that'd be

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-02 Thread Andres Freund
On 2017-10-02 22:56:49 +0200, Andreas Seltenreich wrote: > Hi, > > low-memory testing with REL_10_STABLE at 1f19550a87 produced the > following PANIC: > > stuck spinlock detected at pg_stat_get_wal_receiver, walreceiver.c:1397 Ugh. > I was about to wrap the pstrdup()s with a PG_TRY block,

Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-02 Thread Thomas Munro
On Tue, Oct 3, 2017 at 9:56 AM, Andreas Seltenreich wrote: > low-memory testing with REL_10_STABLE at 1f19550a87 produced the > following PANIC: > > stuck spinlock detected at pg_stat_get_wal_receiver, walreceiver.c:1397 > > I was about to wrap the pstrdup()s with a PG_TRY

[HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-02 Thread Andreas Seltenreich
Hi, low-memory testing with REL_10_STABLE at 1f19550a87 produced the following PANIC: stuck spinlock detected at pg_stat_get_wal_receiver, walreceiver.c:1397 I was about to wrap the pstrdup()s with a PG_TRY block, but I can't find a spinlock being released in a PG_CATCH block anywhere, so