Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-27 Thread Thomas Munro
On Mon, Apr 24, 2017 at 2:53 PM, Tom Lane wrote: > Thomas Munro writes: >> Every machine sees the LSN moving backwards, but the code path that >> had the assertion only reached if it decides to interpolate, which is >> timing dependent: there

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-24 Thread Andres Freund
On 2017-04-24 15:41:25 -0700, Mark Dilger wrote: > The recent fix in 546c13e11b29a5408b9d6a6e3cca301380b47f7f has local variable > overwriteOK > assigned but not used in twophase.c RecoverPreparedTransactions(void). I'm > not sure if that's > future-proofing or an oversight. It seems to be

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-24 Thread Mark Dilger
> On Apr 23, 2017, at 7:53 PM, Tom Lane wrote: > > Thomas Munro writes: >> On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane wrote: >>> Fair enough. But I'd still like an explanation of why only about >>> half of the population

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-23 Thread Tom Lane
Thomas Munro writes: > On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane wrote: >> Fair enough. But I'd still like an explanation of why only about >> half of the population is showing a failure here. Seems like every >> machine should be seeing the

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-23 Thread Thomas Munro
On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane wrote: > Thomas Munro writes: >> On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane wrote: >>> As for this patch itself, is it reasonable to try to assert that the >>> timeline has in fact

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-23 Thread Simon Riggs
On 22 April 2017 at 16:41, Tom Lane wrote: > Thomas Munro writes: >> The assertion fails reliably for me, because standby2's reported write >> LSN jumps backwards after the timeline changes: for example I see >> 302 then 3028470 then 302

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-23 Thread Tom Lane
Thomas Munro writes: > On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane wrote: >> As for this patch itself, is it reasonable to try to assert that the >> timeline has in fact changed? > The protocol doesn't include the timeline in reply messages, so

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Tom Lane
Mark Dilger writes: >> On Apr 22, 2017, at 11:40 AM, Tom Lane wrote: >> In short then, I propose the attached patch to make these cases fail >> more reliably. We might extend this later to allow the old behaviors >> to be explicitly opted-into, but

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Mark Dilger
> On Apr 22, 2017, at 11:40 AM, Tom Lane wrote: > > I wrote: >> Whoa. This just turned into a much larger can of worms than I expected. >> How can it be that processes are getting assertion crashes and yet the >> test framework reports success anyway? That's impossibly >>

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Thomas Munro
On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane wrote: > Thomas Munro writes: >> The assertion fails reliably for me, because standby2's reported write >> LSN jumps backwards after the timeline changes: for example I see >> 302 then 3028470 then

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Tom Lane
I wrote: > Whoa. This just turned into a much larger can of worms than I expected. > How can it be that processes are getting assertion crashes and yet the > test framework reports success anyway? That's impossibly > broken/unacceptable. I poked into this on my laptop, where I'm able to

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Tom Lane
I wrote: > So 6 of 15 critters are getting the walsender.c assertion, > and those six plus six more are seeing the subtrans.c one, > and three are seeing neither one. There's probably a pattern > to that, don't know what it is. Ah, got it: skink *is* seeing the subtrans.c assertion, but not the

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Tom Lane
I wrote: > Taking a quick census of other buildfarm machines that are known to be > running the recovery test, it appears that most (not all) are seeing > one or both traps. But the test is reporting success anyway, everywhere > except on Noah's 32-bit AIX critters. Or, to be a bit more

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Tom Lane
I wrote: > Whoa. This just turned into a much larger can of worms than I expected. > How can it be that processes are getting assertion crashes and yet the > test framework reports success anyway? That's impossibly > broken/unacceptable. Taking a quick census of other buildfarm machines that

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Tom Lane
Thomas Munro writes: > The assertion fails reliably for me, because standby2's reported write > LSN jumps backwards after the timeline changes: for example I see > 302 then 3028470 then 302 followed by a normal progression. > Surprisingly,

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Thomas Munro
On Sat, Apr 22, 2017 at 9:13 PM, Simon Riggs wrote: > On 22 April 2017 at 06:45, Thomas Munro wrote: > >> Thanks. I'm away from my computer right now but will investigate this >> and send a fix later today. > > Thanks. I'll review later

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-22 Thread Simon Riggs
On 22 April 2017 at 06:45, Thomas Munro wrote: > Thanks. I'm away from my computer right now but will investigate this > and send a fix later today. Thanks. I'll review later today. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-21 Thread Thomas Munro
On Sat, Apr 22, 2017 at 9:04 AM, Tom Lane wrote: > Alvaro Herrera writes: >> Simon Riggs wrote: >>> Replication lag tracking for walsenders >>> >>> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication. > >> Did anyone notice that this

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-21 Thread Andres Freund
On 2017-04-21 17:04:08 -0400, Tom Lane wrote: > Some excavation in the buildfarm database says that the coverage for > the recovery-check test has been mighty darn thin up until just > recently. Hm, not good. Just enabled it for most of my animals (there seems to be no point in having it on the

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-21 Thread Tom Lane
Alvaro Herrera writes: > Simon Riggs wrote: >> Replication lag tracking for walsenders >> >> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication. > Did anyone notice that this seems to be causing buildfarm member 'tern' > to fail the recovery check?

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

2017-04-21 Thread Alvaro Herrera
Simon Riggs wrote: > Replication lag tracking for walsenders > > Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication. Did anyone notice that this seems to be causing buildfarm member 'tern' to fail the recovery check? See here: