Re: Why is parula failing?

2024-05-16 Thread Robins Tharakan
On Tue, 14 May 2024 at 08:55, David Rowley wrote: > I've not seen any recent failures from Parula that relate to this > issue. The last one seems to have been about 4 weeks ago. > > I'm now wondering if it's time to revert the debugging code added in > 1db689715. Does anyone think differently?

Re: Why is parula failing?

2024-05-13 Thread Tom Lane
David Rowley writes: > I've not seen any recent failures from Parula that relate to this > issue. The last one seems to have been about 4 weeks ago. > I'm now wondering if it's time to revert the debugging code added in > 1db689715. Does anyone think differently? +1. It seems like we wrote

Re: Why is parula failing?

2024-05-13 Thread David Rowley
On Thu, 21 Mar 2024 at 13:53, David Rowley wrote: > > On Thu, 21 Mar 2024 at 12:36, Tom Lane wrote: > > So yeah, if we could have log_autovacuum_min_duration = 0 perhaps > > that would yield a clue. > > FWIW, I agree with your earlier statement about it looking very much > like auto-vacuum has

Re: Why is parula failing?

2024-04-16 Thread David Rowley
On Tue, 16 Apr 2024 at 18:58, Robins Tharakan wrote: > The last 25 consecutive runs have passed [1] after switching > REL_12_STABLE to -O0 ! So I am wondering whether that confirms that > the compiler version is to blame, and while we're still here, > is there anything else I could try? I don't

Re: Why is parula failing?

2024-04-16 Thread Robins Tharakan
On Mon, 15 Apr 2024 at 16:02, Tom Lane wrote: > David Rowley writes: > > If GetNowFloat() somehow was returning a negative number then we could > > end up with a large delay. But if gettimeofday() was so badly broken > > then wouldn't there be some evidence of this in the log timestamps on > >

Re: Why is parula failing?

2024-04-15 Thread Tom Lane
David Rowley writes: > #4 0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406 > delay = > delay_ms = > endtime = 0 > This endtime looks like a problem. It seems unlikely to be caused by > gettimeofday's timeval fields being zeroed given that the number of > seconds

Re: Why is parula failing?

2024-04-14 Thread Robins Tharakan
On Mon, 15 Apr 2024 at 14:55, David Rowley wrote: > If GetNowFloat() somehow was returning a negative number then we could > end up with a large delay. But if gettimeofday() was so badly broken > then wouldn't there be some evidence of this in the log timestamps on > failing runs? 3 things

Re: Why is parula failing?

2024-04-14 Thread David Rowley
On Mon, 15 Apr 2024 at 16:10, Robins Tharakan wrote: > - I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE > - I'll keep them (stuck) for this week, in case there's more we can get > from them (and to see how long they take) > - Attached are 'bt full' outputs for both (b.txt -

Re: Why is parula failing?

2024-04-14 Thread Robins Tharakan
On Sun, 14 Apr 2024 at 00:12, Tom Lane wrote: > If we were only supposed to sleep 0.1 seconds, how is it waiting > for 60 ms (and, presumably, repeating that)? The logic in > pg_sleep is pretty simple, and it's hard to think of anything except > the system clock jumping (far) backwards that

Re: Why is parula failing?

2024-04-13 Thread Tomas Vondra
On 4/13/24 15:02, Robins Tharakan wrote: > On Wed, 10 Apr 2024 at 10:24, David Rowley wrote: >> >> Master failed today for the first time since the compiler upgrade. >> Again reltuples == 48. > > Here's what I can add over the past few days: > - Almost all failures are either reltuples=48 or

Re: Why is parula failing?

2024-04-13 Thread Tomas Vondra
On 4/9/24 05:48, David Rowley wrote: > On Mon, 8 Apr 2024 at 23:56, Robins Tharakan wrote: >> #3 0x0083ed84 in WaitLatch (latch=, >> wakeEvents=wakeEvents@entry=41, timeout=60, >> wait_event_info=wait_event_info@entry=150994946) at latch.c:538 >> #4 0x00907404 in

Re: Why is parula failing?

2024-04-13 Thread Tom Lane
Robins Tharakan writes: > HEAD is stuck again on pg_sleep(), no CPU for the past hour or so. > Stack trace seems to be similar to last time. > #3 0x008437c4 in WaitLatch (latch=, > wakeEvents=wakeEvents@entry=41, timeout=60, > wait_event_info=wait_event_info@entry=150994946) at

Re: Why is parula failing?

2024-04-13 Thread Robins Tharakan
On Mon, 8 Apr 2024 at 21:25, Robins Tharakan wrote: > > > I'll keep an eye on this instance more often for the next few days. > (Let me know if I could capture more if a run gets stuck again) HEAD is stuck again on pg_sleep(), no CPU for the past hour or so. Stack trace seems to be similar to

Re: Why is parula failing?

2024-04-13 Thread Robins Tharakan
On Wed, 10 Apr 2024 at 10:24, David Rowley wrote: > > Master failed today for the first time since the compiler upgrade. > Again reltuples == 48. Here's what I can add over the past few days: - Almost all failures are either reltuples=48 or SIGABRTs - Almost all SIGABRTs are DDLs - CREATE INDEX

Re: Why is parula failing?

2024-04-09 Thread Robins Tharakan
On Wed, 10 Apr 2024 at 10:24, David Rowley wrote: > Master failed today for the first time since the compiler upgrade. > Again reltuples == 48. >From the buildfarm members page, parula seems to be the only aarch64 + gcc 13.2 combination today, and then I suspect whether this is about gcc v13.2

Re: Why is parula failing?

2024-04-09 Thread David Rowley
On Tue, 9 Apr 2024 at 15:48, David Rowley wrote: > Still no partition_prune failures on master since the compiler version > change. There has been one [1] in REL_16_STABLE. I'm thinking it > might be worth backpatching the partition_prune debug to REL_16_STABLE > to see if we can learn anything

Re: Why is parula failing?

2024-04-08 Thread David Rowley
On Mon, 8 Apr 2024 at 23:56, Robins Tharakan wrote: > #3 0x0083ed84 in WaitLatch (latch=, > wakeEvents=wakeEvents@entry=41, timeout=60, > wait_event_info=wait_event_info@entry=150994946) at latch.c:538 > #4 0x00907404 in pg_sleep (fcinfo=) at misc.c:406 > #17

Re: Why is parula failing?

2024-04-08 Thread Robins Tharakan
On Tue, 2 Apr 2024 at 15:01, Tom Lane wrote: > "Tharakan, Robins" writes: > > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries, > > of which some were even"signal 6: Aborted". > > Ugh... parula didn't send any reports to buildfarm

Re: Why is parula failing?

2024-04-01 Thread Tom Lane
"Tharakan, Robins" writes: >> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests >> stabilize now. > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of > which passed on subsequent-tries, > of which some were even"signal 6: Aborted". Ugh...

RE: Why is parula failing?

2024-04-01 Thread Tharakan, Robins
> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests > stabilize now. So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries, of which some were even"signal 6: Aborted". FWIW, I compiled gcc v13.2 (default options)

RE: Why is parula failing?

2024-04-01 Thread Tharakan, Robins
> ... in connection with which, I can't help noticing that parula is using a > very old compiler: > > configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) > > From some quick checking around, that would have to be near the beginning of > aarch64 > support in RHEL (Fedora hadn't

Re: Why is parula failing?

2024-04-01 Thread Tom Lane
David Rowley writes: > On Sat, 30 Mar 2024 at 09:17, Tom Lane wrote: >> ... in connection with which, I can't help noticing that parula >> is using a very old compiler: >> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) >> I wonder why parula is using that when its

Re: Why is parula failing?

2024-04-01 Thread David Rowley
On Sat, 30 Mar 2024 at 09:17, Tom Lane wrote: > > I wrote: > > I'd not looked closely enough at the previous failure, because > > now that I have, this is well out in WTFF territory: how can > > reltuples be greater than zero when relpages is zero? This can't > > be a state that autovacuum would

Re: Why is parula failing?

2024-03-29 Thread Tom Lane
I wrote: > I'd not looked closely enough at the previous failure, because > now that I have, this is well out in WTFF territory: how can > reltuples be greater than zero when relpages is zero? This can't > be a state that autovacuum would have left behind, unless it's > really seriously broken.

Re: Why is parula failing?

2024-03-29 Thread Tom Lane
David Rowley writes: > On Wed, 27 Mar 2024 at 18:28, Tom Lane wrote: >> Let's wait a bit to see if it fails in HEAD ... but if not, would >> it be reasonable to back-patch the additional debugging output? > I think REL_16_STABLE has told us that it's not an auto-vacuum issue. > I'm uncertain

Re: Why is parula failing?

2024-03-27 Thread David Rowley
On Wed, 27 Mar 2024 at 18:28, Tom Lane wrote: > > David Rowley writes: > > Unfortunately, REL_16_STABLE does not have the additional debugging, > > so don't get to know what reltuples was set to. > > Let's wait a bit to see if it fails in HEAD ... but if not, would > it be reasonable to

Re: Why is parula failing?

2024-03-26 Thread Tom Lane
David Rowley writes: > Unfortunately, REL_16_STABLE does not have the additional debugging, > so don't get to know what reltuples was set to. Let's wait a bit to see if it fails in HEAD ... but if not, would it be reasonable to back-patch the additional debugging output?

Re: Why is parula failing?

2024-03-26 Thread David Rowley
On Tue, 26 Mar 2024 at 21:03, Tharakan, Robins wrote: > > > David Rowley writes: > > It would be good to have log_autovacuum_min_duration = 0 on this machine > > for a while. > > - Have set log_autovacuum_min_duration=0 on parula and a test run came out > okay. > - Also added REL_16_STABLE to

RE: Why is parula failing?

2024-03-26 Thread Tharakan, Robins
Hi David / Tom, > David Rowley writes: > It would be good to have log_autovacuum_min_duration = 0 on this machine for > a while. - Have set log_autovacuum_min_duration=0 on parula and a test run came out okay. - Also added REL_16_STABLE to the branches being tested (in case it matters here).

Re: Why is parula failing?

2024-03-25 Thread David Rowley
On Thu, 21 Mar 2024 at 14:19, Tom Lane wrote: > > David Rowley writes: > > We could also do something like the attached just in case we're > > barking up the wrong tree. > > Yeah, checking indisvalid isn't a bad idea. I'd put another > one further down, just before the DROP of table ab, so we >

Re: Why is parula failing?

2024-03-20 Thread Tom Lane
David Rowley writes: > We could also do something like the attached just in case we're > barking up the wrong tree. Yeah, checking indisvalid isn't a bad idea. I'd put another one further down, just before the DROP of table ab, so we can see the state both before and after the unstable tests.

Re: Why is parula failing?

2024-03-20 Thread David Rowley
On Thu, 21 Mar 2024 at 12:36, Tom Lane wrote: > So yeah, if we could have log_autovacuum_min_duration = 0 perhaps > that would yield a clue. FWIW, I agree with your earlier statement about it looking very much like auto-vacuum has run on that table, but equally, if something like the pg_index

Re: Why is parula failing?

2024-03-20 Thread Tom Lane
David Rowley writes: > Is it worth running that animal with log_autovacuum_min_duration = 0 > so we can see what's going on in terms of auto-vacuum auto-analyze in > the log? Maybe, but I'm not sure. I thought that if parula were somehow hitting an ill-timed autovac/autoanalyze, it should be

Re: Why is parula failing?

2024-03-20 Thread David Rowley
On Wed, 20 Mar 2024 at 08:58, Tom Lane wrote: > I suppose we could attach "autovacuum=off" settings to these tables, > but it doesn't seem to me that that should be necessary. These test > cases are several years old and haven't given trouble before. > Moreover, if that's necessary then there

Re: Why is parula failing?

2024-03-20 Thread Matthias van de Meent
On Wed, 20 Mar 2024 at 11:50, Matthias van de Meent wrote: > > On Tue, 19 Mar 2024 at 20:58, Tom Lane wrote: > > > > For the last few days, buildfarm member parula has been intermittently > > failing the partition_prune regression test, due to unexpected plan > > changes [1][2][3][4]. The

Re: Why is parula failing?

2024-03-20 Thread Matthias van de Meent
On Tue, 19 Mar 2024 at 20:58, Tom Lane wrote: > > For the last few days, buildfarm member parula has been intermittently > failing the partition_prune regression test, due to unexpected plan > changes [1][2][3][4]. The symptoms can be reproduced exactly by > inserting a "vacuum" of one or

Why is parula failing?

2024-03-19 Thread Tom Lane
For the last few days, buildfarm member parula has been intermittently failing the partition_prune regression test, due to unexpected plan changes [1][2][3][4]. The symptoms can be reproduced exactly by inserting a "vacuum" of one or another of the partitions of table "ab", so we can presume that