Re: subscriptionCheck failures on nightjar

2019-09-30 Thread Tom Lane
Andrew Dunstan writes: > On 9/20/19 6:17 PM, Tom Lane wrote: >> Dromedary is running the last release of macOS that supports 32-bit >> hardware, so if we decide to kick that to the curb, I'd either shut >> down the box or put some newer Linux or BSD variant on it. > Well, nightjar is on FBSD 9.0

Re: subscriptionCheck failures on nightjar

2019-09-23 Thread Andrew Dunstan
On 9/20/19 6:17 PM, Tom Lane wrote: > Alvaro Herrera writes: >> Uh .. I didn't think it was possible that we would build the same >> snapshot file more than once. Isn't that a waste of time anyway? Maybe >> we can fix the symptom by just not doing that in the first place? >> I don't have a str

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Michael Paquier
On Fri, Sep 20, 2019 at 05:30:48PM +0200, Tomas Vondra wrote: > But even with that change you haven't managed to reproduce the issue, > right? Or am I misunderstanding? No, I was not able to see it on my laptop running Debian. -- Michael signature.asc Description: PGP signature

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tom Lane
Alvaro Herrera writes: > Uh .. I didn't think it was possible that we would build the same > snapshot file more than once. Isn't that a waste of time anyway? Maybe > we can fix the symptom by just not doing that in the first place? > I don't have a strategy to do that, but seems worth considerin

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Andres Freund
Hi, On September 20, 2019 3:06:20 PM PDT, Alvaro Herrera wrote: >On 2019-Sep-20, Tom Lane wrote: > >> Actually, what I did was as attached [1], and I am getting traces >like >> [2]. The problem seems to occur only when there are two or three >> processes concurrently creating the same snapshot

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Alvaro Herrera
On 2019-Sep-20, Tom Lane wrote: > Actually, what I did was as attached [1], and I am getting traces like > [2]. The problem seems to occur only when there are two or three > processes concurrently creating the same snapshot file. It's not > obvious from the debug trace, but the snapshot file *do

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Andres Freund
Hi, On 2019-09-20 17:49:27 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2019-09-20 16:25:21 -0400, Tom Lane wrote: > >> I recreated my freebsd-9-under-qemu setup and I can still reproduce > >> the problem, though not with high reliability (order of 1 time in 10). > >> Anything particular

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tom Lane
Sigh, forgot about attaching the attachments ... regards, tom lane diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index 0bd1d0f..53fd33c 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/repli

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tom Lane
Andres Freund writes: > On 2019-09-20 16:25:21 -0400, Tom Lane wrote: >> I recreated my freebsd-9-under-qemu setup and I can still reproduce >> the problem, though not with high reliability (order of 1 time in 10). >> Anything particular you want logged? > A DEBUG2 log would help a fair bit, beca

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Andres Freund
Hi, On 2019-09-20 16:25:21 -0400, Tom Lane wrote: > Andres Freund writes: > > Since now a number of people (I tried as well), failed to reproduce this > > locally, I propose that we increase the log-level during this test on > > master. And perhaps expand the set of debugging information. With th

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tom Lane
Andres Freund writes: > Since now a number of people (I tried as well), failed to reproduce this > locally, I propose that we increase the log-level during this test on > master. And perhaps expand the set of debugging information. With the > hope that the additional information on the cases encou

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Andres Freund
Hi, On 2019-09-19 17:20:15 +0530, Kuntal Ghosh wrote: > It seems there is a pattern how the error is occurring in different > systems. Following are the relevant log snippets: > > nightjar: > sub3 LOG: received replication command: CREATE_REPLICATION_SLOT > "sub3_16414_sync_16394" TEMPORARY LOGI

Re: subscriptionCheck failures on nightjar

2019-09-20 Thread Tomas Vondra
On Thu, Sep 19, 2019 at 01:23:05PM +0900, Michael Paquier wrote: On Wed, Sep 18, 2019 at 11:58:08PM +0200, Tomas Vondra wrote: I kinda suspect it might be just a coincidence that it fails during that particular test. What likely plays a role here is a checkpoint timing (AFAICS that's the thing r

Re: subscriptionCheck failures on nightjar

2019-09-19 Thread Michael Paquier
On Thu, Sep 19, 2019 at 05:20:15PM +0530, Kuntal Ghosh wrote: > While subscription 3 is created, it eventually reaches to a consistent > snapshot point and prints the WAL location corresponding to it. It > seems sub1/sub2 immediately fails to serialize the snapshot to the > .snap file having the sa

Re: subscriptionCheck failures on nightjar

2019-09-19 Thread Kuntal Ghosh
Hello hackers, It seems there is a pattern how the error is occurring in different systems. Following are the relevant log snippets: nightjar: sub3 LOG: received replication command: CREATE_REPLICATION_SLOT "sub3_16414_sync_16394" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT sub3 LOG: logical decodi

Re: subscriptionCheck failures on nightjar

2019-09-18 Thread Michael Paquier
On Wed, Sep 18, 2019 at 11:58:08PM +0200, Tomas Vondra wrote: > I kinda suspect it might be just a coincidence that it fails during that > particular test. What likely plays a role here is a checkpoint timing > (AFAICS that's the thing removing the file). On most systems the tests > complete befor

Re: subscriptionCheck failures on nightjar

2019-09-18 Thread Tomas Vondra
On Wed, Sep 18, 2019 at 04:25:14PM +0530, Kuntal Ghosh wrote: Hello Michael, On Wed, Sep 18, 2019 at 6:28 AM Michael Paquier wrote: On my side, I have let this thing run for a couple of hours with a patched version to include a sleep between the rename and the sync but I could not reproduce i

Re: subscriptionCheck failures on nightjar

2019-09-18 Thread Kuntal Ghosh
Hello Michael, On Wed, Sep 18, 2019 at 6:28 AM Michael Paquier wrote: > > On my side, I have let this thing run for a couple of hours with a > patched version to include a sleep between the rename and the sync but > I could not reproduce it either: > #!/bin/bash > attempt=0 > while true; do >

Re: subscriptionCheck failures on nightjar

2019-09-17 Thread Michael Paquier
On Tue, Sep 17, 2019 at 09:45:10PM +0200, Tomas Vondra wrote: > FWIW I agree with Andres that there probably is an actual bug. The file > should not just disappear like this, it's clearly unexpected so the > PANIC does not seem entirely inappropriate. Agreed. > I've tried reproducing the issue o

Re: subscriptionCheck failures on nightjar

2019-09-17 Thread Tomas Vondra
On Tue, Sep 17, 2019 at 12:39:33PM -0400, Tom Lane wrote: Robert Haas writes: On Mon, Aug 26, 2019 at 9:29 AM Tomas Vondra wrote: This is one of the remaining open items, and we don't seem to be moving forward with it :-( Why exactly is this an open item, anyway? The reason it's still he

Re: subscriptionCheck failures on nightjar

2019-09-17 Thread Tom Lane
Robert Haas writes: > On Mon, Aug 26, 2019 at 9:29 AM Tomas Vondra > wrote: >> This is one of the remaining open items, and we don't seem to be moving >> forward with it :-( > Why exactly is this an open item, anyway? The reason it's still here is that Andres expressed a concern that there migh

Re: subscriptionCheck failures on nightjar

2019-09-17 Thread Robert Haas
On Mon, Aug 26, 2019 at 9:29 AM Tomas Vondra wrote: > This is one of the remaining open items, and we don't seem to be moving > forward with it :-( Why exactly is this an open item, anyway? I don't find any discussion on the thread which makes a clear argument that this problem originated with v

Re: subscriptionCheck failures on nightjar

2019-08-26 Thread Tomas Vondra
On Mon, Aug 26, 2019 at 11:01:20AM -0400, Tom Lane wrote: Tomas Vondra writes: I'm willing to take a stab at it, but to do that I need a way to reproduce it. Tom, you mentioned you've managed to reproduce it in a qemu instance, but that it took some fiddling with qemu parmeters or something. Ca

Re: subscriptionCheck failures on nightjar

2019-08-26 Thread Tom Lane
Tomas Vondra writes: > I'm willing to take a stab at it, but to do that I need a way to > reproduce it. Tom, you mentioned you've managed to reproduce it in a > qemu instance, but that it took some fiddling with qemu parmeters or > something. Can you share what exactly was necessary? I don't reca

Re: subscriptionCheck failures on nightjar

2019-08-26 Thread Tomas Vondra
On Tue, Aug 13, 2019 at 05:04:35PM +0900, Michael Paquier wrote: On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote: I'm not yet sure that that's actually something that's supposed to happen, I got to spend some time analysing how this actually happens. Normally the contents of the sl

Re: subscriptionCheck failures on nightjar

2019-08-13 Thread Michael Paquier
On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote: > I'm not yet sure that that's actually something that's supposed to > happen, I got to spend some time analysing how this actually > happens. Normally the contents of the slot should actually prevent it > from being removed (as they're

Re: subscriptionCheck failures on nightjar

2019-06-28 Thread Tom Lane
I wrote: > My animal dromedary just reproduced this failure, which we've previously > only seen on nightjar. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-06-26%2023%3A57%3A45 Twice: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-06-28%2006

Re: subscriptionCheck failures on nightjar

2019-06-26 Thread Tom Lane
Andres Freund writes: > On 2019-02-14 09:52:33 +1300, Thomas Munro wrote: >> Just to make sure I understand: it's OK for the file not to be there >> when we try to fsync it by name, because a concurrent checkpoint can >> remove it, having determined that we don't need it anymore? In other >> word

Re: subscriptionCheck failures on nightjar

2019-03-10 Thread Andrew Dunstan
On 2/13/19 1:12 PM, Andres Freund wrote: > Hi, > > On 2019-02-13 12:59:19 -0500, Tom Lane wrote: >> Andres Freund writes: >>> On 2019-02-13 12:37:35 -0500, Tom Lane wrote: Bleah. But in any case, the rename should not create a situation in which we need to fsync the file data again. >

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Andres Freund
Hi, On 2019-02-14 09:52:33 +1300, Thomas Munro wrote: > On Thu, Feb 14, 2019 at 8:11 AM Tom Lane wrote: > > Andres Freund writes: > > > I was kinda pondering just open coding it. I am not yet convinced that > > > my idea of just using an open FD isn't the least bad approach for the > > > issue

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Tom Lane
Thomas Munro writes: > I found 3 examples of this failing with an ERROR (though not turning > the BF red, so nobody noticed) before the PANIC patch went in: Yeah, I suspected that had happened before with less-obvious consequences. Now that we know where the problem is, you could probably make it

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Thomas Munro
On Thu, Feb 14, 2019 at 8:11 AM Tom Lane wrote: > Andres Freund writes: > > I was kinda pondering just open coding it. I am not yet convinced that > > my idea of just using an open FD isn't the least bad approach for the > > issue at hand. What precisely is the NFS issue you're concerned about?

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Tom Lane
Andres Freund writes: > I was kinda pondering just open coding it. I am not yet convinced that > my idea of just using an open FD isn't the least bad approach for the > issue at hand. What precisely is the NFS issue you're concerned about? I'm not sure that fsync-on-FD after the rename will wor

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Andres Freund
Hi, On 2019-02-13 13:24:03 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2019-02-13 12:59:19 -0500, Tom Lane wrote: > >> Perhaps more to the point, the way this was coded, the PANIC applies > >> to open() failures in fsync_fname_ext() not just fsync() failures; > >> that's painting with t

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Tom Lane
Andres Freund writes: > On 2019-02-13 12:59:19 -0500, Tom Lane wrote: >> Perhaps more to the point, the way this was coded, the PANIC applies >> to open() failures in fsync_fname_ext() not just fsync() failures; >> that's painting with too broad a brush isn't it? > That indeed seems wrong. Thomas

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Andres Freund
Hi, On 2019-02-13 12:59:19 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2019-02-13 12:37:35 -0500, Tom Lane wrote: > >> Bleah. But in any case, the rename should not create a situation > >> in which we need to fsync the file data again. > > > Well, it's not super well defined which of

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Tom Lane
Andres Freund writes: > On 2019-02-13 12:37:35 -0500, Tom Lane wrote: >> Bleah. But in any case, the rename should not create a situation >> in which we need to fsync the file data again. > Well, it's not super well defined which of either you need to make the > rename durable, and it appears to

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Andres Freund
Hi, On 2019-02-13 12:37:35 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2019-02-13 11:57:32 -0500, Tom Lane wrote: > >> I've managed to reproduce this locally, and obtained this PANIC: > > > Cool. How exactly? > > Andrew told me that nightjar is actually running in a qemu VM, > so I se

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Tom Lane
Andres Freund writes: > On 2019-02-13 11:57:32 -0500, Tom Lane wrote: >> I've managed to reproduce this locally, and obtained this PANIC: > Cool. How exactly? Andrew told me that nightjar is actually running in a qemu VM, so I set up freebsd 9.0 in a qemu VM, and boom. It took a bit of fiddling

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Andres Freund
Hi, On 2019-02-13 11:57:32 -0500, Tom Lane wrote: > I've managed to reproduce this locally, and obtained this PANIC: Cool. How exactly? Nice catch. > Anyway, I think we might be able to fix this along the lines of > > CloseTransientFile(fd); > > + /* ensure snapshot file is down to stab

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Tom Lane
Thomas Munro writes: > On Mon, Feb 11, 2019 at 7:31 PM Tom Lane wrote: >> 2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file >> "pg_logical/snapshots/0-160B578.snap": No such file or directory > > They get atomically renamed into place, which seems kosher even if > snapshots

Re: subscriptionCheck failures on nightjar

2019-02-13 Thread Thomas Munro
On Mon, Feb 11, 2019 at 7:31 PM Tom Lane wrote: > 2019-02-10 23:55:58.798 EST [40728] sub1 PANIC: could not open file > "pg_logical/snapshots/0-160B578.snap": No such file or directory They get atomically renamed into place, which seems kosher even if snapshots for the same LSN are created co

subscriptionCheck failures on nightjar

2019-02-10 Thread Tom Lane
nightjar just did this: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2019-02-11%2004%3A33%3A07 The critical bit seems to be that the publisher side of the 010_truncate.pl test failed like so: 2019-02-10 23:55:58.765 EST [40771] sub3 LOG: statement: BEGIN READ ONLY ISOLAT