subject:"Why is parula failing\?"

Re: Why is parula failing?

2024-05-16 Thread Robins Tharakan

On Tue, 14 May 2024 at 08:55, David Rowley  wrote:

> I've not seen any recent failures from Parula that relate to this
> issue.  The last one seems to have been about 4 weeks ago.
>
> I'm now wondering if it's time to revert the debugging code added in
> 1db689715.  Does anyone think differently?
>

Thanks for keeping an eye. Sadly the older machine was decommissioned
and thus parula hasn't been sending results to buildfarm the past few days.

I'll try to build a similar machine (but newer gcc etc.) and reopen this
thread in case I hit
something similar.
-
robins

Re: Why is parula failing?

2024-05-13 Thread Tom Lane

David Rowley  writes:
> I've not seen any recent failures from Parula that relate to this
> issue.  The last one seems to have been about 4 weeks ago.

> I'm now wondering if it's time to revert the debugging code added in
> 1db689715.  Does anyone think differently?

+1.  It seems like we wrote off the other issue as a compiler bug,
so I don't have much trouble assuming that this one was too.

regards, tom lane

Re: Why is parula failing?

2024-05-13 Thread David Rowley

On Thu, 21 Mar 2024 at 13:53, David Rowley  wrote:
>
> On Thu, 21 Mar 2024 at 12:36, Tom Lane  wrote:
> > So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
> > that would yield a clue.
>
> FWIW, I agree with your earlier statement about it looking very much
> like auto-vacuum has run on that table, but equally, if something like
> the pg_index record was damaged we could get the same plan change.
>
> We could also do something like the attached just in case we're
> barking up the wrong tree.

I've not seen any recent failures from Parula that relate to this
issue.  The last one seems to have been about 4 weeks ago.

I'm now wondering if it's time to revert the debugging code added in
1db689715.  Does anyone think differently?

David

Re: Why is parula failing?

2024-04-16 Thread David Rowley

On Tue, 16 Apr 2024 at 18:58, Robins Tharakan  wrote:
> The last 25 consecutive runs have passed [1] after switching
> REL_12_STABLE to -O0 ! So I am wondering whether that confirms that
> the compiler version is to blame, and while we're still here,
> is there anything else I could try?

I don't think it's conclusive proof that it's a compiler bug.  -O0
could equally just have changed the timing sufficiently to not trigger
the issue.

It might be a long shot, but I wonder if it might be worth running a
workload such as:

psql -c "create table a (a int primary key, b text not null, c int not
null); insert into a values(1,'',0);" postgres
echo "update a set b = repeat('a', random(1,10)), c=c+1 where a = 1;"
> bench.sql
pgbench -n -t 12500 -c 8 -j 8 -f bench.sql postgres
psql -c "table a;" postgres

On a build with the original CFLAGS.  I expect the value of "c" to be
100k after running that. If it's not then something bad has happened.

That would exercise the locking code heavily and would show us if any
updates were missed due to locks not being correctly respected or seen
by other backends.

David

Re: Why is parula failing?

2024-04-16 Thread Robins Tharakan

On Mon, 15 Apr 2024 at 16:02, Tom Lane  wrote:
> David Rowley  writes:
> > If GetNowFloat() somehow was returning a negative number then we could
> > end up with a large delay.  But if gettimeofday() was so badly broken
> > then wouldn't there be some evidence of this in the log timestamps on
> > failing runs?
>
> And indeed that too.  I'm finding the "compiler bug" theory
> palatable.  Robins mentioned having built the compiler from
> source, which theoretically should work, but maybe something
> went wrong?  Or it's missing some important bug fix?
>
> It might be interesting to back the animal's CFLAGS down
> to -O0 and see if things get more stable.

The last 25 consecutive runs have passed [1] after switching
REL_12_STABLE to -O0 ! So I am wondering whether that confirms that
the compiler version is to blame, and while we're still here,
is there anything else I could try?

If not, by Sunday, I am considering switching parula to gcc v12 (or even
v14 experimental - given that massasauga [2] has been pretty stable since
its upgrade a few days back).

Reference:
1.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=parula=REL_12_STABLE
2.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE
-
robins

Re: Why is parula failing?

2024-04-15 Thread Tom Lane

David Rowley  writes:
> #4  0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406
> delay = 
> delay_ms = 
> endtime = 0

> This endtime looks like a problem. It seems unlikely to be caused by
> gettimeofday's timeval fields being zeroed given that the number of
> seconds should have been added to that.

Yes, that is very odd.

> I can't quite make sense of how we end up sleeping at all with a zero
> endtime. Assuming the subsequent GetNowFloats() worked, "delay =
> endtime - GetNowFloat();" would result in a negative sleep duration
> and we'd break out of the sleep loop.

If GetCurrentTimestamp() were returning assorted random values, it
wouldn't be hard to imagine this loop sleeping for a long time.
But it's very hard to see how that theory leads to an "endtime"
of exactly zero rather than some other number, and even harder
to credit two different runs getting "endtime" of exactly zero.

> If GetNowFloat() somehow was returning a negative number then we could
> end up with a large delay.  But if gettimeofday() was so badly broken
> then wouldn't there be some evidence of this in the log timestamps on
> failing runs?

And indeed that too.  I'm finding the "compiler bug" theory
palatable.  Robins mentioned having built the compiler from
source, which theoretically should work, but maybe something
went wrong?  Or it's missing some important bug fix?

It might be interesting to back the animal's CFLAGS down
to -O0 and see if things get more stable.

regards, tom lane

Re: Why is parula failing?

2024-04-14 Thread Robins Tharakan

On Mon, 15 Apr 2024 at 14:55, David Rowley  wrote:
> If GetNowFloat() somehow was returning a negative number then we could
> end up with a large delay.  But if gettimeofday() was so badly broken
> then wouldn't there be some evidence of this in the log timestamps on
> failing runs?

3 things stand out for me here, unsure if they're related somehow:

1. Issue where reltuples=48 (in essence runs complete, but few tests fail)
2. SIGABRT - most of which are DDLs (runs complete, but engine crashes +
many tests fail)
3. pg_sleep() stuck - (runs never complete, IIUC never gets reported to
buildfarm)

For #3, one thing I had done earlier (and then reverted) was to set the
'wait_timeout' from current undef to 2 hours. I'll set it again to 2hrs
in hopes that #3 starts getting reported to buildfarm too.

> I'm not that familiar with the buildfarm config, but I do see some
> Valgrind related setting in there. Is PostgreSQL running under
> Valgrind on these runs?

Not yet. I was tempted, but valgrind has not yet been enabled on
this member. IIUC by default they're disabled.

   'use_valgrind' => undef,

-
robins

Re: Why is parula failing?

2024-04-14 Thread David Rowley

On Mon, 15 Apr 2024 at 16:10, Robins Tharakan  wrote:
> - I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE
> - I'll keep them (stuck) for this week, in case there's more we can get
> from them (and to see how long they take)
> - Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt - 
> REL_16_STABLE)

Thanks for getting those.

#4  0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406
delay = 
delay_ms = 
endtime = 0

This endtime looks like a problem. It seems unlikely to be caused by
gettimeofday's timeval fields being zeroed given that the number of
seconds should have been added to that.

I can't quite make sense of how we end up sleeping at all with a zero
endtime. Assuming the subsequent GetNowFloats() worked, "delay =
endtime - GetNowFloat();" would result in a negative sleep duration
and we'd break out of the sleep loop.

If GetNowFloat() somehow was returning a negative number then we could
end up with a large delay.  But if gettimeofday() was so badly broken
then wouldn't there be some evidence of this in the log timestamps on
failing runs?

I'm not that familiar with the buildfarm config, but I do see some
Valgrind related setting in there. Is PostgreSQL running under
Valgrind on these runs?

David

Re: Why is parula failing?

2024-04-14 Thread Robins Tharakan

On Sun, 14 Apr 2024 at 00:12, Tom Lane  wrote:
> If we were only supposed to sleep 0.1 seconds, how is it waiting
> for 60 ms (and, presumably, repeating that)?  The logic in
> pg_sleep is pretty simple, and it's hard to think of anything except
> the system clock jumping (far) backwards that would make this
> happen.  Any chance of extracting the local variables from the
> pg_sleep stack frame?

- I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE
- I'll keep them (stuck) for this week, in case there's more we can get
from them (and to see how long they take)
- Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt -
REL_16_STABLE)

A few things to add:
- To reiterate, this instance has gcc v13.2 compiled without any
flags (my first time ever TBH) IIRC 'make -k check' came out okay,
so at this point I don't think I did something obviously wrong when
building gcc from git.
- I installed gcc v14.0.1 experimental on massasauga (also an aarch64
and built from git) and despite multiple runs, it seems to be doing okay
[1].
- Next week (if I'm still scratching my head - and unless someone advises
otherwise), I'll upgrade parula to gcc 14 experimental to see if this is
about
gcc maturity on graviton (for some reason). I don't expect much to come
out of it though (given Tomas testing on rpi5, but doesn't hurt)

Ref:
1.
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE

-
robins
[postgres@ip-172-31-18-25 ~]$ pstack 26147
#0  0xadeda954 in epoll_pwait () from /lib64/libc.so.6
#1  0x00842888 in WaitEventSetWaitBlock (nevents=1, 
occurred_events=, cur_timeout=60, set=0x3148fac0) at 
latch.c:1570
#2  WaitEventSetWait (set=0x3148fac0, timeout=timeout@entry=60, 
occurred_events=occurred_events@entry=0xd1194748, nevents=nevents@entry=1, 
wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3  0x00842c44 in WaitLatch (latch=, 
wakeEvents=wakeEvents@entry=41, timeout=60, 
wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4  0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406
#5  0x00698430 in ExecInterpExpr (state=0x316a6040, 
econtext=0x316a5e38, isnull=) at execExprInterp.c:764
#6  0x006d0898 in ExecEvalExprSwitchContext (isNull=0xd11948bf, 
econtext=0x316a5e38, state=) at 
../../../src/include/executor/executor.h:356
#7  ExecProject (projInfo=) at 
../../../src/include/executor/executor.h:390
#8  ExecResult (pstate=) at nodeResult.c:135
#9  0x006b92ec in ExecProcNode (node=0x316a5d28) at 
../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x316a5b38) at nodeGather.c:287
#11 ExecGather (pstate=0x316a5b38) at nodeGather.c:222
#12 0x0069c36c in ExecProcNode (node=0x316a5b38) at 
../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=, dest=0x31641e90, 
direction=, numberTuples=0, sendTuples=, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x316a5b38, 
estate=0x316a5910) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x316459c0, direction=, 
count=0, execute_once=) at execMain.c:363
#15 0x00871564 in PortalRunSelect (portal=portal@entry=0x31512fb0, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x31641e90) at pquery.c:924
#16 0x00872d80 in PortalRun (portal=portal@entry=0x31512fb0, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=run_once@entry=true, dest=dest@entry=0x31641e90, 
altdest=altdest@entry=0x31641e90, qc=qc@entry=0xd1194c70) at pquery.c:768
#17 0x0086ea54 in exec_simple_query 
(query_string=query_string@entry=0x31493c90 "SELECT pg_sleep(0.1);") at 
postgres.c:1274
#18 0x0086f590 in PostgresMain (dbname=, 
username=) at postgres.c:4680
#19 0x0086ab20 in BackendMain (startup_data=, 
startup_data_len=) at backend_startup.c:105
#20 0x007c54d8 in postmaster_child_launch 
(child_type=child_type@entry=B_BACKEND, 
startup_data=startup_data@entry=0xd1195138 "", 
startup_data_len=startup_data_len@entry=4, 
client_sock=client_sock@entry=0xd1195140) at launch_backend.c:265
#21 0x007c8ec0 in BackendStartup (client_sock=0xd1195140) at 
postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x007cab68 in PostmasterMain (argc=argc@entry=8, 
argv=argv@entry=0x3148f320) at postmaster.c:1372
#24 0x00496cb8 in main (argc=8, argv=0x3148f320) at main.c:197
[postgres@ip-172-31-18-25 ~]$


 2072 root   20   0  117M  4376  3192 S  0.0  0.0  0:00.00 â  ââ 
/usr/sbin/CROND -n
 2087 postgres   20   0 20988  6496  5504 S  0.0  0.0  0:00.00 â ââ 
/usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f postgres
 2092 postgres   20   0 20960  6328  5336 S  0.0  0.0  0:00.00 â â  
ââ /usr/sbin/postdrop -r
 2074 postgres   20   0  111M  2660  2488 S  0.0  0.0  0:00.00 â ââ 
/bin/sh -c cd

Re: Why is parula failing?

2024-04-13 Thread Tomas Vondra

On 4/13/24 15:02, Robins Tharakan wrote:
> On Wed, 10 Apr 2024 at 10:24, David Rowley  wrote:
>>
>> Master failed today for the first time since the compiler upgrade.
>> Again reltuples == 48.
> 
> Here's what I can add over the past few days:
> - Almost all failures are either reltuples=48 or SIGABRTs
> - Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS
>   - A little too coincidental? Recent crashes have stack-trace if
> interested.
> 
> Barring the initial failures (during move to gcc 13.2), in the past week:
> - v15 somehow hasn't had a failure yet
> - v14 / v16 have got only 1 failure each
> - but v12 / v13 are lit up - failed multiple times.
> 

I happened to have an unused rpi5, so I installed Armbian aarch64 with
gcc 13.2.0, built with exactly the same configure options as parula, and
did ~300 loops of "make check" without a single failure.

So either parula has packages in a different way, or maybe it's a more
of a timing issue and rpi5 is way slower than graviton3.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Why is parula failing?

2024-04-13 Thread Tomas Vondra




On 4/9/24 05:48, David Rowley wrote:
> On Mon, 8 Apr 2024 at 23:56, Robins Tharakan  wrote:
>> #3  0x0083ed84 in WaitLatch (latch=, 
>> wakeEvents=wakeEvents@entry=41, timeout=60, 
>> wait_event_info=wait_event_info@entry=150994946) at latch.c:538
>> #4  0x00907404 in pg_sleep (fcinfo=) at misc.c:406
> 
>> #17 0x0086a944 in exec_simple_query 
>> (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at 
>> postgres.c:1274
> 
> I have no idea why WaitLatch has timeout=60.  That should be no
> higher than timeout=100 for "SELECT pg_sleep(0.1);".  I have no
> theories aside from a failing RAM module, cosmic ray or a well-timed
> clock change between the first call to gettimeofday() in pg_sleep()
> and the next one.
> 
> I know this animal is running debug_parallel_query = regress, so that
> 0.1 Const did have to get serialized and copied to the worker, so
> there's another opportunity for the sleep duration to be stomped on,
> but that seems pretty unlikely.
> 

AFAIK that GUC is set only for HEAD, so it would not explain the
failures on the other branches.

> I can't think of a reason why the erroneous  reltuples=48 would be
> consistent over 2 failing runs if it were failing RAM or a cosmic ray.
> 

Yeah, that seems very unlikely.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Why is parula failing?

2024-04-13 Thread Tom Lane

Robins Tharakan  writes:
> HEAD is stuck again on pg_sleep(), no CPU for the past hour or so.
> Stack trace seems to be similar to last time.

> #3  0x008437c4 in WaitLatch (latch=,
> wakeEvents=wakeEvents@entry=41, timeout=60,
> wait_event_info=wait_event_info@entry=150994946) at latch.c:538
> #4  0x0090c384 in pg_sleep (fcinfo=) at misc.c:406
> ...
> #17 0x0086f5d4 in exec_simple_query
> (query_string=query_string@entry=0x3b391c90
> "SELECT pg_sleep(0.1);") at postgres.c:1274

If we were only supposed to sleep 0.1 seconds, how is it waiting
for 60 ms (and, presumably, repeating that)?  The logic in
pg_sleep is pretty simple, and it's hard to think of anything except
the system clock jumping (far) backwards that would make this
happen.  Any chance of extracting the local variables from the
pg_sleep stack frame?

regards, tom lane

Re: Why is parula failing?

2024-04-13 Thread Robins Tharakan

On Mon, 8 Apr 2024 at 21:25, Robins Tharakan  wrote:
>
>
> I'll keep an eye on this instance more often for the next few days.
> (Let me know if I could capture more if a run gets stuck again)


HEAD is stuck again on pg_sleep(), no CPU for the past hour or so.
Stack trace seems to be similar to last time.


$ pstack 24930
#0  0xb8280954 in epoll_pwait () from /lib64/libc.so.6
#1  0x00843408 in WaitEventSetWaitBlock (nevents=1,
occurred_events=, cur_timeout=60, set=0x3b38dac0) at
latch.c:1570
#2  WaitEventSetWait (set=0x3b38dac0, timeout=timeout@entry=60,
occurred_events=occurred_events@entry=0xfd1d66c8, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3  0x008437c4 in WaitLatch (latch=,
wakeEvents=wakeEvents@entry=41, timeout=60,
wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4  0x0090c384 in pg_sleep (fcinfo=) at misc.c:406
#5  0x00699350 in ExecInterpExpr (state=0x3b5a41a0,
econtext=0x3b5a3f98, isnull=) at execExprInterp.c:764
#6  0x006d1668 in ExecEvalExprSwitchContext (isNull=0xfd1d683f,
econtext=0x3b5a3f98, state=) at
../../../src/include/executor/executor.h:356
#7  ExecProject (projInfo=) at
../../../src/include/executor/executor.h:390
#8  ExecResult (pstate=) at nodeResult.c:135
#9  0x006ba26c in ExecProcNode (node=0x3b5a3e88) at
../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x3b5a3c98) at nodeGather.c:287
#11 ExecGather (pstate=0x3b5a3c98) at nodeGather.c:222
#12 0x0069d28c in ExecProcNode (node=0x3b5a3c98) at
../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=, dest=0x3b5ae8e0,
direction=, numberTuples=0, sendTuples=,
operation=CMD_SELECT, use_parallel_mode=,
planstate=0x3b5a3c98, estate=0x3b5a3a70) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x3b59c250, direction=,
count=0, execute_once=) at execMain.c:363
#15 0x008720e4 in PortalRunSelect (portal=portal@entry=0x3b410fb0,
forward=forward@entry=true, count=0, count@entry=9223372036854775807,
dest=dest@entry=0x3b5ae8e0) at pquery.c:924
#16 0x00873900 in PortalRun (portal=portal@entry=0x3b410fb0,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x3b5ae8e0,
altdest=altdest@entry=0x3b5ae8e0, qc=qc@entry=0xfd1d6bf0) at
pquery.c:768
#17 0x0086f5d4 in exec_simple_query
(query_string=query_string@entry=0x3b391c90
"SELECT pg_sleep(0.1);") at postgres.c:1274
#18 0x00870110 in PostgresMain (dbname=,
username=) at postgres.c:4680
#19 0x0086b6a0 in BackendMain (startup_data=,
startup_data_len=) at backend_startup.c:105
#20 0x007c6268 in postmaster_child_launch
(child_type=child_type@entry=B_BACKEND,
startup_data=startup_data@entry=0xfd1d70b8
"", startup_data_len=startup_data_len@entry=4,
client_sock=client_sock@entry=0xfd1d70c0)
at launch_backend.c:265
#21 0x007c9c50 in BackendStartup (client_sock=0xfd1d70c0) at
postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x007cb8f8 in PostmasterMain (argc=argc@entry=8,
argv=argv@entry=0x3b38d320)
at postmaster.c:1372
#24 0x00496e18 in main (argc=8, argv=0x3b38d320) at main.c:197



 CPU% MEM%   TIME+  Command
.
.
  0.0  0.0  0:00.00 │ └─ /bin/sh -c cd /opt/postgres/build-farm-14 &&
PATH=/opt/gcc/home/ec2-user/proj/gcc/target/bin/
  0.0  0.1  0:00.07 │└─ /usr/bin/perl ./run_build.pl
--config=build-farm.conf HEAD --verbose
  0.0  0.0  0:00.00 │   └─ sh -c { cd pgsql.build/src/test/regress
&& make NO_LOCALE=1 check; echo $? > /opt/postg
  0.0  0.0  0:00.00 │  └─ make NO_LOCALE=1 check
  0.0  0.0  0:00.00 │ └─ /bin/sh -c echo "# +++ regress
check in src/test/regress +++" && PATH="/opt/postg
  0.0  0.0  0:00.10 │└─
../../../src/test/regress/pg_regress --temp-instance=./tmp_check
--inputdir=.
  0.0  0.0  0:00.01 │   ├─ psql -X -a -q -d regression
-v HIDE_TABLEAM=on -v HIDE_TOAST_COMPRESSION=on
  0.0  0.1  0:02.64 │   └─ postgres -D
/opt/postgres/build-farm-14/buildroot/HEAD/pgsql.build/src/test
  0.0  0.2  0:00.05 │  ├─ postgres: postgres
regression [local] SELECT
  0.0  0.0  0:00.06 │  ├─ postgres: logical
replication launcher
  0.0  0.1  0:00.36 │  ├─ postgres: autovacuum
launcher
  0.0  0.1  0:00.34 │  ├─ postgres: walwriter
  0.0  0.0  0:00.32 │  ├─ postgres: background
writer
  0.0  0.3  0:00.05 │  └─ postgres: checkpointer

-
robins

>

Re: Why is parula failing?

2024-04-13 Thread Robins Tharakan

On Wed, 10 Apr 2024 at 10:24, David Rowley  wrote:
>
> Master failed today for the first time since the compiler upgrade.
> Again reltuples == 48.

Here's what I can add over the past few days:
- Almost all failures are either reltuples=48 or SIGABRTs
- Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS
  - A little too coincidental? Recent crashes have stack-trace if
interested.

Barring the initial failures (during move to gcc 13.2), in the past week:
- v15 somehow hasn't had a failure yet
- v14 / v16 have got only 1 failure each
- but v12 / v13 are lit up - failed multiple times.

-
robins

Re: Why is parula failing?

2024-04-09 Thread Robins Tharakan

On Wed, 10 Apr 2024 at 10:24, David Rowley  wrote:
> Master failed today for the first time since the compiler upgrade.
> Again reltuples == 48.

>From the buildfarm members page, parula seems to be the only aarch64 + gcc
13.2
combination today, and then I suspect whether this is about gcc v13.2
maturity on aarch64?

I'll try to upgrade one of the other aarch64s I have (massasauga or
snakefly) and
see if this is more about gcc 13.2 maturity on this architecture.
-
robins

Re: Why is parula failing?

2024-04-09 Thread David Rowley

On Tue, 9 Apr 2024 at 15:48, David Rowley  wrote:
> Still no partition_prune failures on master since the compiler version
> change.  There has been one [1] in REL_16_STABLE. I'm thinking it
> might be worth backpatching the partition_prune debug to REL_16_STABLE
> to see if we can learn anything new from it.

Master failed today for the first time since the compiler upgrade.
Again reltuples == 48.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-04-10%2000%3A27%3A02

David

Re: Why is parula failing?

2024-04-08 Thread David Rowley

On Mon, 8 Apr 2024 at 23:56, Robins Tharakan  wrote:
> #3  0x0083ed84 in WaitLatch (latch=, 
> wakeEvents=wakeEvents@entry=41, timeout=60, 
> wait_event_info=wait_event_info@entry=150994946) at latch.c:538
> #4  0x00907404 in pg_sleep (fcinfo=) at misc.c:406

> #17 0x0086a944 in exec_simple_query 
> (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at 
> postgres.c:1274

I have no idea why WaitLatch has timeout=60.  That should be no
higher than timeout=100 for "SELECT pg_sleep(0.1);".  I have no
theories aside from a failing RAM module, cosmic ray or a well-timed
clock change between the first call to gettimeofday() in pg_sleep()
and the next one.

I know this animal is running debug_parallel_query = regress, so that
0.1 Const did have to get serialized and copied to the worker, so
there's another opportunity for the sleep duration to be stomped on,
but that seems pretty unlikely.

I can't think of a reason why the erroneous  reltuples=48 would be
consistent over 2 failing runs if it were failing RAM or a cosmic ray.

Still no partition_prune failures on master since the compiler version
change.  There has been one [1] in REL_16_STABLE. I'm thinking it
might be worth backpatching the partition_prune debug to REL_16_STABLE
to see if we can learn anything new from it.

David

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-04-08%2002%3A12%3A02

Re: Why is parula failing?

2024-04-08 Thread Robins Tharakan

On Tue, 2 Apr 2024 at 15:01, Tom Lane  wrote:
> "Tharakan, Robins"  writes:
> > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16)
all of which passed on subsequent-tries,
> > of which some were even"signal 6: Aborted".
>
> Ugh...


parula didn't send any reports to buildfarm for the past 44 hours. Logged in
to see that postgres was stuck on pg_sleep(), which was quite odd! I
captured
the backtrace and triggered another run on HEAD, which came out
okay.

I'll keep an eye on this instance more often for the next few days.
(Let me know if I could capture more if a run gets stuck again)


(gdb) bt
#0  0x952ae954 in epoll_pwait () from /lib64/libc.so.6
#1  0x0083e9c8 in WaitEventSetWaitBlock (nevents=1,
occurred_events=, cur_timeout=297992, set=0x2816dac0) at
latch.c:1570
#2  WaitEventSetWait (set=0x2816dac0, timeout=timeout@entry=60,
occurred_events=occurred_events@entry=0xc395ed28, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=150994946) at latch.c:1516
#3  0x0083ed84 in WaitLatch (latch=,
wakeEvents=wakeEvents@entry=41, timeout=60,
wait_event_info=wait_event_info@entry=150994946) at latch.c:538
#4  0x00907404 in pg_sleep (fcinfo=) at misc.c:406
#5  0x00696b10 in ExecInterpExpr (state=0x28384040,
econtext=0x28383e38, isnull=) at execExprInterp.c:764
#6  0x006ceef8 in ExecEvalExprSwitchContext (isNull=0xc395ee9f,
econtext=0x28383e38, state=) at
../../../src/include/executor/executor.h:356
#7  ExecProject (projInfo=) at
../../../src/include/executor/executor.h:390
#8  ExecResult (pstate=) at nodeResult.c:135
#9  0x006b7aec in ExecProcNode (node=0x28383d28) at
../../../src/include/executor/executor.h:274
#10 gather_getnext (gatherstate=0x28383b38) at nodeGather.c:287
#11 ExecGather (pstate=0x28383b38) at nodeGather.c:222
#12 0x0069aa4c in ExecProcNode (node=0x28383b38) at
../../../src/include/executor/executor.h:274
#13 ExecutePlan (execute_once=, dest=0x2831ffb0,
direction=, numberTuples=0, sendTuples=,
operation=CMD_SELECT, use_parallel_mode=,
planstate=0x28383b38, estate=0x28383910) at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x283239c0, direction=,
count=0, execute_once=) at execMain.c:363
#15 0x0086d454 in PortalRunSelect (portal=portal@entry=0x281f0fb0,
forward=forward@entry=true, count=0, count@entry=9223372036854775807,
dest=dest@entry=0x2831ffb0) at pquery.c:924
#16 0x0086ec70 in PortalRun (portal=portal@entry=0x281f0fb0,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x2831ffb0,
altdest=altdest@entry=0x2831ffb0, qc=qc@entry=0xc395f250) at
pquery.c:768
#17 0x0086a944 in exec_simple_query
(query_string=query_string@entry=0x28171c90
"SELECT pg_sleep(0.1);") at postgres.c:1274
#18 0x0086b480 in PostgresMain (dbname=,
username=) at postgres.c:4680
#19 0x00866a0c in BackendMain (startup_data=,
startup_data_len=) at backend_startup.c:101
#20 0x007c1738 in postmaster_child_launch
(child_type=child_type@entry=B_BACKEND,
startup_data=startup_data@entry=0xc395f718
"", startup_data_len=startup_data_len@entry=4,
client_sock=client_sock@entry=0xc395f720)
at launch_backend.c:265
#21 0x007c5120 in BackendStartup (client_sock=0xc395f720) at
postmaster.c:3593
#22 ServerLoop () at postmaster.c:1674
#23 0x007c6dc8 in PostmasterMain (argc=argc@entry=8,
argv=argv@entry=0x2816d320)
at postmaster.c:1372
#24 0x00496bb8 in main (argc=8, argv=0x2816d320) at main.c:197


>
> The update_personality.pl script in the buildfarm client distro
> is what to use to adjust OS version or compiler version data.
>
Thanks. Fixed that.

-
robins

Re: Why is parula failing?

2024-04-01 Thread Tom Lane

"Tharakan, Robins"  writes:
>> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests 
>> stabilize now.

> So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of 
> which passed on subsequent-tries,
> of which some were even"signal 6: Aborted".

Ugh...

> Also, it'd be great if someone could point me to a way to update the 
> "Compiler" section in "System Detail" on
> the buildfarm page (it wrongly shows GCC as v7.3.1).

The update_personality.pl script in the buildfarm client distro
is what to use to adjust OS version or compiler version data.

regards, tom lane

RE: Why is parula failing?

2024-04-01 Thread Tharakan, Robins

> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests 
> stabilize now.

So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of 
which passed on subsequent-tries,
of which some were even"signal 6: Aborted".

FWIW, I compiled gcc v13.2 (default options) from source which IIUC shouldn't 
be to blame. Two other possible
reasons could be that the buildfarm doesn't have an aarch64 + gcc 13.2 
combination (quick check I couldn't
see any), or, that this hardware is flaky.

Either way, this instance is a throw-away so let me know if this isn't helping. 
I'll swap it out in case there's not
much benefit to be had.

Also, it'd be great if someone could point me to a way to update the "Compiler" 
section in "System Detail" on
the buildfarm page (it wrongly shows GCC as v7.3.1).

-
thanks
robins

RE: Why is parula failing?

2024-04-01 Thread Tharakan, Robins

> ... in connection with which, I can't help noticing that parula is using a 
> very old compiler:
>
> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>
> From some quick checking around, that would have to be near the beginning of 
> aarch64
> support in RHEL (Fedora hadn't promoted aarch64 to a primary architecture 
> until earlier
> that same year).  It's not exactly hard to believe that there were some 
> lingering compiler bugs.

> I wonder why parula is using that when its underlying system seems markedly 
> newer (the kernel at least has a recent build date).

Parula used GCC v7.3.1 since that's what came by default.
I've now switched to GCC v13.2 and triggered a run. Let's see if the tests 
stabilize now.
-
Robins

Re: Why is parula failing?

2024-04-01 Thread Tom Lane

David Rowley  writes:
> On Sat, 30 Mar 2024 at 09:17, Tom Lane  wrote:
>> ... in connection with which, I can't help noticing that parula
>> is using a very old compiler:
>> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>> I wonder why parula is using that when its underlying system seems
>> markedly newer (the kernel at least has a recent build date).

> It could be, but wouldn't the bug have to relate to the locking code
> and be caused by some other backend stomping on the memory?
> Otherwise, shouldn't it be failing consistently every run rather than
> sporadically?

Your guess is as good as mine ... but we still haven't seen this
class of failure on any other animal, so the idea that it's strictly
a chance timing issue is getting weaker all the time.

regards, tom lane

Re: Why is parula failing?

2024-04-01 Thread David Rowley

On Sat, 30 Mar 2024 at 09:17, Tom Lane  wrote:
>
> I wrote:
> > I'd not looked closely enough at the previous failure, because
> > now that I have, this is well out in WTFF territory: how can
> > reltuples be greater than zero when relpages is zero?  This can't
> > be a state that autovacuum would have left behind, unless it's
> > really seriously broken.  I think we need to be looking for
> > explanations like "memory stomp" or "compiler bug".
>
> ... in connection with which, I can't help noticing that parula
> is using a very old compiler:
>
> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
>
> From some quick checking around, that would have to be near the
> beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64
> to a primary architecture until earlier that same year).  It's not
> exactly hard to believe that there were some lingering compiler bugs.
> I wonder why parula is using that when its underlying system seems
> markedly newer (the kernel at least has a recent build date).

It could be, but wouldn't the bug have to relate to the locking code
and be caused by some other backend stomping on the memory?
Otherwise, shouldn't it be failing consistently every run rather than
sporadically?

David

Re: Why is parula failing?

2024-03-29 Thread Tom Lane

I wrote:
> I'd not looked closely enough at the previous failure, because
> now that I have, this is well out in WTFF territory: how can
> reltuples be greater than zero when relpages is zero?  This can't
> be a state that autovacuum would have left behind, unless it's
> really seriously broken.  I think we need to be looking for
> explanations like "memory stomp" or "compiler bug".

... in connection with which, I can't help noticing that parula
is using a very old compiler:

configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)

>From some quick checking around, that would have to be near the
beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64
to a primary architecture until earlier that same year).  It's not
exactly hard to believe that there were some lingering compiler bugs.
I wonder why parula is using that when its underlying system seems
markedly newer (the kernel at least has a recent build date).

regards, tom lane

Re: Why is parula failing?

2024-03-29 Thread Tom Lane

David Rowley  writes:
> On Wed, 27 Mar 2024 at 18:28, Tom Lane  wrote:
>> Let's wait a bit to see if it fails in HEAD ... but if not, would
>> it be reasonable to back-patch the additional debugging output?

> I think REL_16_STABLE has told us that it's not an auto-vacuum issue.
> I'm uncertain what a few more failures in master will tell us aside
> from if reltuples == 48 is consistent or if that value is going to
> fluctuate.

> Let's give it a week and see if it fails a few more times.

We have another failure today [1] with the same symptom:

  ab_a2  |0 |-1 ||0 |   
  0
- ab_a2_b1   |0 |-1 ||0 |   
  0
+ ab_a2_b1   |0 |48 ||0 |   
  0
  ab_a2_b1_a_idx |1 | 0 | t  |  |   

Different table, same "48" reltuples.  But I have to confess that
I'd not looked closely enough at the previous failure, because
now that I have, this is well out in WTFF territory: how can
reltuples be greater than zero when relpages is zero?  This can't
be a state that autovacuum would have left behind, unless it's
really seriously broken.  I think we need to be looking for
explanations like "memory stomp" or "compiler bug".

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-29%2012%3A46%3A02

Re: Why is parula failing?

2024-03-27 Thread David Rowley

On Wed, 27 Mar 2024 at 18:28, Tom Lane  wrote:
>
> David Rowley  writes:
> > Unfortunately, REL_16_STABLE does not have the additional debugging,
> > so don't get to know what reltuples was set to.
>
> Let's wait a bit to see if it fails in HEAD ... but if not, would
> it be reasonable to back-patch the additional debugging output?

I think REL_16_STABLE has told us that it's not an auto-vacuum issue.
I'm uncertain what a few more failures in master will tell us aside
from if reltuples == 48 is consistent or if that value is going to
fluctuate.

Let's give it a week and see if it fails a few more times.

David

Re: Why is parula failing?

2024-03-26 Thread Tom Lane

David Rowley  writes:
> Unfortunately, REL_16_STABLE does not have the additional debugging,
> so don't get to know what reltuples was set to.

Let's wait a bit to see if it fails in HEAD ... but if not, would
it be reasonable to back-patch the additional debugging output?

regards, tom lane

Re: Why is parula failing?

2024-03-26 Thread David Rowley

On Tue, 26 Mar 2024 at 21:03, Tharakan, Robins  wrote:
>
> > David Rowley  writes:
> > It would be good to have log_autovacuum_min_duration = 0 on this machine 
> > for a while.
>
> - Have set log_autovacuum_min_duration=0 on parula and a test run came out 
> okay.
> - Also added REL_16_STABLE to the branches being tested (in case it matters 
> here).

Thanks for doing that.

I see PG_16_STABLE has failed twice already with the same issue.  I
don't see any autovacuum / autoanalyze in the log, so I guess that
rules out auto vacuum activity causing this.

Unfortunately, REL_16_STABLE does not have the additional debugging,
so don't get to know what reltuples was set to.

David

RE: Why is parula failing?

2024-03-26 Thread Tharakan, Robins

Hi David / Tom,

> David Rowley  writes:
> It would be good to have log_autovacuum_min_duration = 0 on this machine for 
> a while.

- Have set log_autovacuum_min_duration=0 on parula and a test run came out okay.
- Also added REL_16_STABLE to the branches being tested (in case it matters 
here).

Let me know if I can help with any other changes here.
-
Robins | tharar@ | adelaide@australia

Re: Why is parula failing?

2024-03-25 Thread David Rowley

On Thu, 21 Mar 2024 at 14:19, Tom Lane  wrote:
>
> David Rowley  writes:
> > We could also do something like the attached just in case we're
> > barking up the wrong tree.
>
> Yeah, checking indisvalid isn't a bad idea.  I'd put another
> one further down, just before the DROP of table ab, so we
> can see the state both before and after the unstable tests.

So it's taken quite a while to finally fail again.

Effectively, we're getting:

relname | relpages | reltuples | indisvalid | autovacuum_count
| autoanalyze_count
+--+---++--+---
- ab_a2_b2   |0 |-1 ||
0 | 0
+ ab_a2_b2   |0 |48 ||
0 | 0

I see AddNewRelationTuple() does set reltuples to -1, so I can't quite
figure out why 48 is in there.  Even if auto-analyze had somehow
mistakenly run and the autoanalyze_count stats just were not
up-to-date yet, the table has zero blocks, and I don't see how
acquire_sample_rows() would set *totalrows to anything other than 0.0
in this case.  For the vacuum case, I see that reltuples is set from:

/* now we can compute the new value for pg_class.reltuples */
vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
vacrel->scanned_pages,
vacrel->live_tuples);

Again, hard to see how that could come to anything other than zero
given that rel_pages and scanned_pages should be 0.

Looking at the binary representation of a float of -1 vs 48, they're
not nearly the same. 0xBF80 vs 0x4240, so it's not looking
like a flipped bit.

It would be good to have log_autovacuum_min_duration = 0 on this
machine for a while.

David

Re: Why is parula failing?

2024-03-20 Thread Tom Lane

David Rowley  writes:
> We could also do something like the attached just in case we're
> barking up the wrong tree.

Yeah, checking indisvalid isn't a bad idea.  I'd put another
one further down, just before the DROP of table ab, so we
can see the state both before and after the unstable tests.

regards, tom lane

Re: Why is parula failing?

2024-03-20 Thread David Rowley

On Thu, 21 Mar 2024 at 12:36, Tom Lane  wrote:
> So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
> that would yield a clue.

FWIW, I agree with your earlier statement about it looking very much
like auto-vacuum has run on that table, but equally, if something like
the pg_index record was damaged we could get the same plan change.

We could also do something like the attached just in case we're
barking up the wrong tree.

David
diff --git a/src/test/regress/expected/partition_prune.out 
b/src/test/regress/expected/partition_prune.out
index 7ca98397ae..f6950e160d 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -2449,6 +2449,37 @@ create index ab_a3_b3_a_idx on ab_a3_b3 (a);
 set enable_hashjoin = 0;
 set enable_mergejoin = 0;
 set enable_memoize = 0;
+-- Temporarily install some debugging to investigate plan instability.
+select 
c.relname,c.relpages,c.reltuples,i.indisvalid,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+left join pg_index i on c.oid = i.indexrelid
+where c.relname like 'ab\_%' order by c.relname;
+relname | relpages | reltuples | indisvalid | autovacuum_count | 
autoanalyze_count 
++--+---++--+---
+ ab_a1  |0 |-1 ||0 |   
  0
+ ab_a1_b1   |0 |-1 ||0 |   
  0
+ ab_a1_b1_a_idx |1 | 0 | t  |  |   
   
+ ab_a1_b2   |0 |-1 ||0 |   
  0
+ ab_a1_b2_a_idx |1 | 0 | t  |  |   
   
+ ab_a1_b3   |0 |-1 ||0 |   
  0
+ ab_a1_b3_a_idx |1 | 0 | t  |  |   
   
+ ab_a2  |0 |-1 ||0 |   
  0
+ ab_a2_b1   |0 |-1 ||0 |   
  0
+ ab_a2_b1_a_idx |1 | 0 | t  |  |   
   
+ ab_a2_b2   |0 |-1 ||0 |   
  0
+ ab_a2_b2_a_idx |1 | 0 | t  |  |   
   
+ ab_a2_b3   |0 |-1 ||0 |   
  0
+ ab_a2_b3_a_idx |1 | 0 | t  |  |   
   
+ ab_a3  |0 |-1 ||0 |   
  0
+ ab_a3_b1   |0 |-1 ||0 |   
  0
+ ab_a3_b1_a_idx |1 | 0 | t  |  |   
   
+ ab_a3_b2   |0 |-1 ||0 |   
  0
+ ab_a3_b2_a_idx |1 | 0 | t  |  |   
   
+ ab_a3_b3   |0 |-1 ||0 |   
  0
+ ab_a3_b3_a_idx |1 | 0 | t  |  |   
   
+(21 rows)
+
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a 
on ab.a = a.a where a.a in(0, 0, 1)');
 explain_parallel_append
 
 

diff --git a/src/test/regress/sql/partition_prune.sql 
b/src/test/regress/sql/partition_prune.sql
index a09b27d820..3ce90f2d89 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -607,6 +607,13 @@ set enable_hashjoin = 0;
 set enable_mergejoin = 0;
 set enable_memoize = 0;
 
+-- Temporarily install some debugging to investigate plan instability.
+select 
c.relname,c.relpages,c.reltuples,i.indisvalid,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+left join pg_index i on c.oid = i.indexrelid
+where c.relname like 'ab\_%' order by c.relname;
+
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a 
on ab.a = a.a where a.a in(0, 0, 1)');
 
 -- Ensure the same partitions are pruned when we make the nested loop

Re: Why is parula failing?

2024-03-20 Thread Tom Lane

David Rowley  writes:
> Is it worth running that animal with log_autovacuum_min_duration = 0
> so we can see what's going on in terms of auto-vacuum auto-analyze in
> the log?

Maybe, but I'm not sure.  I thought that if parula were somehow
hitting an ill-timed autovac/autoanalyze, it should be possible to
make that reproducible by inserting "pg_sleep(60)" or so in the test
script, to give the autovac daemon plenty of time to come around and
do the dirty deed.  No luck though --- the results didn't change for
me.  So now I'm not sure what is going on.

Perhaps though it is autovacuum, and there's some environment-specific
enabling condition that parula has and my machine doesn't (which
could also help explain why no other animal is doing this).
So yeah, if we could have log_autovacuum_min_duration = 0 perhaps
that would yield a clue.

regards, tom lane

Re: Why is parula failing?

2024-03-20 Thread David Rowley

On Wed, 20 Mar 2024 at 08:58, Tom Lane  wrote:
> I suppose we could attach "autovacuum=off" settings to these tables,
> but it doesn't seem to me that that should be necessary.  These test
> cases are several years old and haven't given trouble before.
> Moreover, if that's necessary then there are a lot of other regression
> tests that would presumably need the same treatment.

Is it worth running that animal with log_autovacuum_min_duration = 0
so we can see what's going on in terms of auto-vacuum auto-analyze in
the log?

David

Re: Why is parula failing?

2024-03-20 Thread Matthias van de Meent

On Wed, 20 Mar 2024 at 11:50, Matthias van de Meent
 wrote:
>
> On Tue, 19 Mar 2024 at 20:58, Tom Lane  wrote:
> >
> > For the last few days, buildfarm member parula has been intermittently
> > failing the partition_prune regression test, due to unexpected plan
> > changes [1][2][3][4].  The symptoms can be reproduced exactly by
> > inserting a "vacuum" of one or another of the partitions of table
> > "ab", so we can presume that the underlying cause is an autovacuum run
> > against one of those tables.  But why is that happening?  None of
> > those tables receive any insertions during the test, so I don't
> > understand why autovacuum would trigger on them.

> This may be purely coincidental, but yesterday I also did have a
> seemingly random failure in the recovery test suite locally, in
> t/027_stream_regress.pl, where it changed the join order of exactly
> one of the queries (that uses the tenk table multiple times, iirc 3x
> or so).
[...]
> Sadly, I did not save the output of that test run, so this is just
> about all the information I have. The commit I was testing on was
> based on ca108be7, and system config is available if needed.

It looks like Cirrus CI reproduced my issue with recent commit
a0390f6c [0] and previously also with 4665cebc [1], 875e46a0 [2],
449e798c [3], and other older commits, on both the Windows Server 2019
build and Debian for a39f1a36 (with slightly different plan changes on
this Debian run).  That should rule out most of the environments.

After analyzing the logs produced by the various primaries, I can't
find a good explanation why they would have this issue.  The table is
vacuum analyzed before the regression tests start, and in this run
autovacuum/autoanalyze doesn't seem to kick in until (at least)
seconds after this query was run.

Kind regards,

Matthias van de Meent

[0] https://cirrus-ci.com/task/6295909005262848
[1] https://cirrus-ci.com/task/5229745516838912
[2] https://cirrus-ci.com/task/5098544567156736
[3] https://cirrus-ci.com/task/4783906419900416

Re: Why is parula failing?

2024-03-20 Thread Matthias van de Meent

On Tue, 19 Mar 2024 at 20:58, Tom Lane  wrote:
>
> For the last few days, buildfarm member parula has been intermittently
> failing the partition_prune regression test, due to unexpected plan
> changes [1][2][3][4].  The symptoms can be reproduced exactly by
> inserting a "vacuum" of one or another of the partitions of table
> "ab", so we can presume that the underlying cause is an autovacuum run
> against one of those tables.  But why is that happening?  None of
> those tables receive any insertions during the test, so I don't
> understand why autovacuum would trigger on them.
>
> I suppose we could attach "autovacuum=off" settings to these tables,
> but it doesn't seem to me that that should be necessary.  These test
> cases are several years old and haven't given trouble before.
> Moreover, if that's necessary then there are a lot of other regression
> tests that would presumably need the same treatment.
>
> I'm also baffled why this wasn't happening before.  I scraped the
> buildfarm logs for 3 months back and confirmed my impression that
> this is a new failure mode.  But one of these four runs is on
> REL_14_STABLE, eliminating the theory that the cause is a recent
> HEAD-only change.
>
> Any ideas?

This may be purely coincidental, but yesterday I also did have a
seemingly random failure in the recovery test suite locally, in
t/027_stream_regress.pl, where it changed the join order of exactly
one of the queries (that uses the tenk table multiple times, iirc 3x
or so). As the work I was doing wasn't related to join order-related
problems, this surprised me. After checking for test file changes
(none), I re-ran the tests without recompilation and the failure went
away, so I attributed this to an untimely autoanalize. However, as
this was also an unexpected plan change in the tests this could be
related.

Sadly, I did not save the output of that test run, so this is just
about all the information I have. The commit I was testing on was
based on ca108be7, and system config is available if needed.

-Matthias

Why is parula failing?

2024-03-19 Thread Tom Lane

For the last few days, buildfarm member parula has been intermittently
failing the partition_prune regression test, due to unexpected plan
changes [1][2][3][4].  The symptoms can be reproduced exactly by
inserting a "vacuum" of one or another of the partitions of table
"ab", so we can presume that the underlying cause is an autovacuum run
against one of those tables.  But why is that happening?  None of
those tables receive any insertions during the test, so I don't
understand why autovacuum would trigger on them.

I suppose we could attach "autovacuum=off" settings to these tables,
but it doesn't seem to me that that should be necessary.  These test
cases are several years old and haven't given trouble before.
Moreover, if that's necessary then there are a lot of other regression
tests that would presumably need the same treatment.

I'm also baffled why this wasn't happening before.  I scraped the
buildfarm logs for 3 months back and confirmed my impression that
this is a new failure mode.  But one of these four runs is on
REL_14_STABLE, eliminating the theory that the cause is a recent
HEAD-only change.

Any ideas?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-19%2016%3A09%3A02
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-18%2011%3A13%3A02
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-14%2011%3A40%3A02
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-14%2019%3A00%3A02

37 matches

Mail list logo