Re: Why is parula failing?
On Tue, 14 May 2024 at 08:55, David Rowley wrote: > I've not seen any recent failures from Parula that relate to this > issue. The last one seems to have been about 4 weeks ago. > > I'm now wondering if it's time to revert the debugging code added in > 1db689715. Does anyone think differently? > Thanks for keeping an eye. Sadly the older machine was decommissioned and thus parula hasn't been sending results to buildfarm the past few days. I'll try to build a similar machine (but newer gcc etc.) and reopen this thread in case I hit something similar. - robins
Re: Why is parula failing?
David Rowley writes: > I've not seen any recent failures from Parula that relate to this > issue. The last one seems to have been about 4 weeks ago. > I'm now wondering if it's time to revert the debugging code added in > 1db689715. Does anyone think differently? +1. It seems like we wrote off the other issue as a compiler bug, so I don't have much trouble assuming that this one was too. regards, tom lane
Re: Why is parula failing?
On Thu, 21 Mar 2024 at 13:53, David Rowley wrote: > > On Thu, 21 Mar 2024 at 12:36, Tom Lane wrote: > > So yeah, if we could have log_autovacuum_min_duration = 0 perhaps > > that would yield a clue. > > FWIW, I agree with your earlier statement about it looking very much > like auto-vacuum has run on that table, but equally, if something like > the pg_index record was damaged we could get the same plan change. > > We could also do something like the attached just in case we're > barking up the wrong tree. I've not seen any recent failures from Parula that relate to this issue. The last one seems to have been about 4 weeks ago. I'm now wondering if it's time to revert the debugging code added in 1db689715. Does anyone think differently? David
Re: Why is parula failing?
On Tue, 16 Apr 2024 at 18:58, Robins Tharakan wrote: > The last 25 consecutive runs have passed [1] after switching > REL_12_STABLE to -O0 ! So I am wondering whether that confirms that > the compiler version is to blame, and while we're still here, > is there anything else I could try? I don't think it's conclusive proof that it's a compiler bug. -O0 could equally just have changed the timing sufficiently to not trigger the issue. It might be a long shot, but I wonder if it might be worth running a workload such as: psql -c "create table a (a int primary key, b text not null, c int not null); insert into a values(1,'',0);" postgres echo "update a set b = repeat('a', random(1,10)), c=c+1 where a = 1;" > bench.sql pgbench -n -t 12500 -c 8 -j 8 -f bench.sql postgres psql -c "table a;" postgres On a build with the original CFLAGS. I expect the value of "c" to be 100k after running that. If it's not then something bad has happened. That would exercise the locking code heavily and would show us if any updates were missed due to locks not being correctly respected or seen by other backends. David
Re: Why is parula failing?
On Mon, 15 Apr 2024 at 16:02, Tom Lane wrote: > David Rowley writes: > > If GetNowFloat() somehow was returning a negative number then we could > > end up with a large delay. But if gettimeofday() was so badly broken > > then wouldn't there be some evidence of this in the log timestamps on > > failing runs? > > And indeed that too. I'm finding the "compiler bug" theory > palatable. Robins mentioned having built the compiler from > source, which theoretically should work, but maybe something > went wrong? Or it's missing some important bug fix? > > It might be interesting to back the animal's CFLAGS down > to -O0 and see if things get more stable. The last 25 consecutive runs have passed [1] after switching REL_12_STABLE to -O0 ! So I am wondering whether that confirms that the compiler version is to blame, and while we're still here, is there anything else I could try? If not, by Sunday, I am considering switching parula to gcc v12 (or even v14 experimental - given that massasauga [2] has been pretty stable since its upgrade a few days back). Reference: 1. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=parula=REL_12_STABLE 2. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE - robins
Re: Why is parula failing?
David Rowley writes: > #4 0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406 > delay = > delay_ms = > endtime = 0 > This endtime looks like a problem. It seems unlikely to be caused by > gettimeofday's timeval fields being zeroed given that the number of > seconds should have been added to that. Yes, that is very odd. > I can't quite make sense of how we end up sleeping at all with a zero > endtime. Assuming the subsequent GetNowFloats() worked, "delay = > endtime - GetNowFloat();" would result in a negative sleep duration > and we'd break out of the sleep loop. If GetCurrentTimestamp() were returning assorted random values, it wouldn't be hard to imagine this loop sleeping for a long time. But it's very hard to see how that theory leads to an "endtime" of exactly zero rather than some other number, and even harder to credit two different runs getting "endtime" of exactly zero. > If GetNowFloat() somehow was returning a negative number then we could > end up with a large delay. But if gettimeofday() was so badly broken > then wouldn't there be some evidence of this in the log timestamps on > failing runs? And indeed that too. I'm finding the "compiler bug" theory palatable. Robins mentioned having built the compiler from source, which theoretically should work, but maybe something went wrong? Or it's missing some important bug fix? It might be interesting to back the animal's CFLAGS down to -O0 and see if things get more stable. regards, tom lane
Re: Why is parula failing?
On Mon, 15 Apr 2024 at 14:55, David Rowley wrote: > If GetNowFloat() somehow was returning a negative number then we could > end up with a large delay. But if gettimeofday() was so badly broken > then wouldn't there be some evidence of this in the log timestamps on > failing runs? 3 things stand out for me here, unsure if they're related somehow: 1. Issue where reltuples=48 (in essence runs complete, but few tests fail) 2. SIGABRT - most of which are DDLs (runs complete, but engine crashes + many tests fail) 3. pg_sleep() stuck - (runs never complete, IIUC never gets reported to buildfarm) For #3, one thing I had done earlier (and then reverted) was to set the 'wait_timeout' from current undef to 2 hours. I'll set it again to 2hrs in hopes that #3 starts getting reported to buildfarm too. > I'm not that familiar with the buildfarm config, but I do see some > Valgrind related setting in there. Is PostgreSQL running under > Valgrind on these runs? Not yet. I was tempted, but valgrind has not yet been enabled on this member. IIUC by default they're disabled. 'use_valgrind' => undef, - robins
Re: Why is parula failing?
On Mon, 15 Apr 2024 at 16:10, Robins Tharakan wrote: > - I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE > - I'll keep them (stuck) for this week, in case there's more we can get > from them (and to see how long they take) > - Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt - > REL_16_STABLE) Thanks for getting those. #4 0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406 delay = delay_ms = endtime = 0 This endtime looks like a problem. It seems unlikely to be caused by gettimeofday's timeval fields being zeroed given that the number of seconds should have been added to that. I can't quite make sense of how we end up sleeping at all with a zero endtime. Assuming the subsequent GetNowFloats() worked, "delay = endtime - GetNowFloat();" would result in a negative sleep duration and we'd break out of the sleep loop. If GetNowFloat() somehow was returning a negative number then we could end up with a large delay. But if gettimeofday() was so badly broken then wouldn't there be some evidence of this in the log timestamps on failing runs? I'm not that familiar with the buildfarm config, but I do see some Valgrind related setting in there. Is PostgreSQL running under Valgrind on these runs? David
Re: Why is parula failing?
On Sun, 14 Apr 2024 at 00:12, Tom Lane wrote: > If we were only supposed to sleep 0.1 seconds, how is it waiting > for 60 ms (and, presumably, repeating that)? The logic in > pg_sleep is pretty simple, and it's hard to think of anything except > the system clock jumping (far) backwards that would make this > happen. Any chance of extracting the local variables from the > pg_sleep stack frame? - I now have 2 separate runs stuck on pg_sleep() - HEAD / REL_16_STABLE - I'll keep them (stuck) for this week, in case there's more we can get from them (and to see how long they take) - Attached are 'bt full' outputs for both (b.txt - HEAD / a.txt - REL_16_STABLE) A few things to add: - To reiterate, this instance has gcc v13.2 compiled without any flags (my first time ever TBH) IIRC 'make -k check' came out okay, so at this point I don't think I did something obviously wrong when building gcc from git. - I installed gcc v14.0.1 experimental on massasauga (also an aarch64 and built from git) and despite multiple runs, it seems to be doing okay [1]. - Next week (if I'm still scratching my head - and unless someone advises otherwise), I'll upgrade parula to gcc 14 experimental to see if this is about gcc maturity on graviton (for some reason). I don't expect much to come out of it though (given Tomas testing on rpi5, but doesn't hurt) Ref: 1. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=massasauga=REL_12_STABLE - robins [postgres@ip-172-31-18-25 ~]$ pstack 26147 #0 0xadeda954 in epoll_pwait () from /lib64/libc.so.6 #1 0x00842888 in WaitEventSetWaitBlock (nevents=1, occurred_events=, cur_timeout=60, set=0x3148fac0) at latch.c:1570 #2 WaitEventSetWait (set=0x3148fac0, timeout=timeout@entry=60, occurred_events=occurred_events@entry=0xd1194748, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994946) at latch.c:1516 #3 0x00842c44 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=41, timeout=60, wait_event_info=wait_event_info@entry=150994946) at latch.c:538 #4 0x0090b7b4 in pg_sleep (fcinfo=) at misc.c:406 #5 0x00698430 in ExecInterpExpr (state=0x316a6040, econtext=0x316a5e38, isnull=) at execExprInterp.c:764 #6 0x006d0898 in ExecEvalExprSwitchContext (isNull=0xd11948bf, econtext=0x316a5e38, state=) at ../../../src/include/executor/executor.h:356 #7 ExecProject (projInfo=) at ../../../src/include/executor/executor.h:390 #8 ExecResult (pstate=) at nodeResult.c:135 #9 0x006b92ec in ExecProcNode (node=0x316a5d28) at ../../../src/include/executor/executor.h:274 #10 gather_getnext (gatherstate=0x316a5b38) at nodeGather.c:287 #11 ExecGather (pstate=0x316a5b38) at nodeGather.c:222 #12 0x0069c36c in ExecProcNode (node=0x316a5b38) at ../../../src/include/executor/executor.h:274 #13 ExecutePlan (execute_once=, dest=0x31641e90, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x316a5b38, estate=0x316a5910) at execMain.c:1646 #14 standard_ExecutorRun (queryDesc=0x316459c0, direction=, count=0, execute_once=) at execMain.c:363 #15 0x00871564 in PortalRunSelect (portal=portal@entry=0x31512fb0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x31641e90) at pquery.c:924 #16 0x00872d80 in PortalRun (portal=portal@entry=0x31512fb0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x31641e90, altdest=altdest@entry=0x31641e90, qc=qc@entry=0xd1194c70) at pquery.c:768 #17 0x0086ea54 in exec_simple_query (query_string=query_string@entry=0x31493c90 "SELECT pg_sleep(0.1);") at postgres.c:1274 #18 0x0086f590 in PostgresMain (dbname=, username=) at postgres.c:4680 #19 0x0086ab20 in BackendMain (startup_data=, startup_data_len=) at backend_startup.c:105 #20 0x007c54d8 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0xd1195138 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0xd1195140) at launch_backend.c:265 #21 0x007c8ec0 in BackendStartup (client_sock=0xd1195140) at postmaster.c:3593 #22 ServerLoop () at postmaster.c:1674 #23 0x007cab68 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x3148f320) at postmaster.c:1372 #24 0x00496cb8 in main (argc=8, argv=0x3148f320) at main.c:197 [postgres@ip-172-31-18-25 ~]$ 2072 root 20 0 117M 4376 3192 S 0.0 0.0 0:00.00 â ââ /usr/sbin/CROND -n 2087 postgres 20 0 20988 6496 5504 S 0.0 0.0 0:00.00 â ââ /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f postgres 2092 postgres 20 0 20960 6328 5336 S 0.0 0.0 0:00.00 â â ââ /usr/sbin/postdrop -r 2074 postgres 20 0 111M 2660 2488 S 0.0 0.0 0:00.00 â ââ /bin/sh -c cd
Re: Why is parula failing?
On 4/13/24 15:02, Robins Tharakan wrote: > On Wed, 10 Apr 2024 at 10:24, David Rowley wrote: >> >> Master failed today for the first time since the compiler upgrade. >> Again reltuples == 48. > > Here's what I can add over the past few days: > - Almost all failures are either reltuples=48 or SIGABRTs > - Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS > - A little too coincidental? Recent crashes have stack-trace if > interested. > > Barring the initial failures (during move to gcc 13.2), in the past week: > - v15 somehow hasn't had a failure yet > - v14 / v16 have got only 1 failure each > - but v12 / v13 are lit up - failed multiple times. > I happened to have an unused rpi5, so I installed Armbian aarch64 with gcc 13.2.0, built with exactly the same configure options as parula, and did ~300 loops of "make check" without a single failure. So either parula has packages in a different way, or maybe it's a more of a timing issue and rpi5 is way slower than graviton3. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Why is parula failing?
On 4/9/24 05:48, David Rowley wrote: > On Mon, 8 Apr 2024 at 23:56, Robins Tharakan wrote: >> #3 0x0083ed84 in WaitLatch (latch=, >> wakeEvents=wakeEvents@entry=41, timeout=60, >> wait_event_info=wait_event_info@entry=150994946) at latch.c:538 >> #4 0x00907404 in pg_sleep (fcinfo=) at misc.c:406 > >> #17 0x0086a944 in exec_simple_query >> (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at >> postgres.c:1274 > > I have no idea why WaitLatch has timeout=60. That should be no > higher than timeout=100 for "SELECT pg_sleep(0.1);". I have no > theories aside from a failing RAM module, cosmic ray or a well-timed > clock change between the first call to gettimeofday() in pg_sleep() > and the next one. > > I know this animal is running debug_parallel_query = regress, so that > 0.1 Const did have to get serialized and copied to the worker, so > there's another opportunity for the sleep duration to be stomped on, > but that seems pretty unlikely. > AFAIK that GUC is set only for HEAD, so it would not explain the failures on the other branches. > I can't think of a reason why the erroneous reltuples=48 would be > consistent over 2 failing runs if it were failing RAM or a cosmic ray. > Yeah, that seems very unlikely. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Why is parula failing?
Robins Tharakan writes: > HEAD is stuck again on pg_sleep(), no CPU for the past hour or so. > Stack trace seems to be similar to last time. > #3 0x008437c4 in WaitLatch (latch=, > wakeEvents=wakeEvents@entry=41, timeout=60, > wait_event_info=wait_event_info@entry=150994946) at latch.c:538 > #4 0x0090c384 in pg_sleep (fcinfo=) at misc.c:406 > ... > #17 0x0086f5d4 in exec_simple_query > (query_string=query_string@entry=0x3b391c90 > "SELECT pg_sleep(0.1);") at postgres.c:1274 If we were only supposed to sleep 0.1 seconds, how is it waiting for 60 ms (and, presumably, repeating that)? The logic in pg_sleep is pretty simple, and it's hard to think of anything except the system clock jumping (far) backwards that would make this happen. Any chance of extracting the local variables from the pg_sleep stack frame? regards, tom lane
Re: Why is parula failing?
On Mon, 8 Apr 2024 at 21:25, Robins Tharakan wrote: > > > I'll keep an eye on this instance more often for the next few days. > (Let me know if I could capture more if a run gets stuck again) HEAD is stuck again on pg_sleep(), no CPU for the past hour or so. Stack trace seems to be similar to last time. $ pstack 24930 #0 0xb8280954 in epoll_pwait () from /lib64/libc.so.6 #1 0x00843408 in WaitEventSetWaitBlock (nevents=1, occurred_events=, cur_timeout=60, set=0x3b38dac0) at latch.c:1570 #2 WaitEventSetWait (set=0x3b38dac0, timeout=timeout@entry=60, occurred_events=occurred_events@entry=0xfd1d66c8, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994946) at latch.c:1516 #3 0x008437c4 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=41, timeout=60, wait_event_info=wait_event_info@entry=150994946) at latch.c:538 #4 0x0090c384 in pg_sleep (fcinfo=) at misc.c:406 #5 0x00699350 in ExecInterpExpr (state=0x3b5a41a0, econtext=0x3b5a3f98, isnull=) at execExprInterp.c:764 #6 0x006d1668 in ExecEvalExprSwitchContext (isNull=0xfd1d683f, econtext=0x3b5a3f98, state=) at ../../../src/include/executor/executor.h:356 #7 ExecProject (projInfo=) at ../../../src/include/executor/executor.h:390 #8 ExecResult (pstate=) at nodeResult.c:135 #9 0x006ba26c in ExecProcNode (node=0x3b5a3e88) at ../../../src/include/executor/executor.h:274 #10 gather_getnext (gatherstate=0x3b5a3c98) at nodeGather.c:287 #11 ExecGather (pstate=0x3b5a3c98) at nodeGather.c:222 #12 0x0069d28c in ExecProcNode (node=0x3b5a3c98) at ../../../src/include/executor/executor.h:274 #13 ExecutePlan (execute_once=, dest=0x3b5ae8e0, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x3b5a3c98, estate=0x3b5a3a70) at execMain.c:1646 #14 standard_ExecutorRun (queryDesc=0x3b59c250, direction=, count=0, execute_once=) at execMain.c:363 #15 0x008720e4 in PortalRunSelect (portal=portal@entry=0x3b410fb0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x3b5ae8e0) at pquery.c:924 #16 0x00873900 in PortalRun (portal=portal@entry=0x3b410fb0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x3b5ae8e0, altdest=altdest@entry=0x3b5ae8e0, qc=qc@entry=0xfd1d6bf0) at pquery.c:768 #17 0x0086f5d4 in exec_simple_query (query_string=query_string@entry=0x3b391c90 "SELECT pg_sleep(0.1);") at postgres.c:1274 #18 0x00870110 in PostgresMain (dbname=, username=) at postgres.c:4680 #19 0x0086b6a0 in BackendMain (startup_data=, startup_data_len=) at backend_startup.c:105 #20 0x007c6268 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0xfd1d70b8 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0xfd1d70c0) at launch_backend.c:265 #21 0x007c9c50 in BackendStartup (client_sock=0xfd1d70c0) at postmaster.c:3593 #22 ServerLoop () at postmaster.c:1674 #23 0x007cb8f8 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x3b38d320) at postmaster.c:1372 #24 0x00496e18 in main (argc=8, argv=0x3b38d320) at main.c:197 CPU% MEM% TIME+ Command . . 0.0 0.0 0:00.00 │ └─ /bin/sh -c cd /opt/postgres/build-farm-14 && PATH=/opt/gcc/home/ec2-user/proj/gcc/target/bin/ 0.0 0.1 0:00.07 │└─ /usr/bin/perl ./run_build.pl --config=build-farm.conf HEAD --verbose 0.0 0.0 0:00.00 │ └─ sh -c { cd pgsql.build/src/test/regress && make NO_LOCALE=1 check; echo $? > /opt/postg 0.0 0.0 0:00.00 │ └─ make NO_LOCALE=1 check 0.0 0.0 0:00.00 │ └─ /bin/sh -c echo "# +++ regress check in src/test/regress +++" && PATH="/opt/postg 0.0 0.0 0:00.10 │└─ ../../../src/test/regress/pg_regress --temp-instance=./tmp_check --inputdir=. 0.0 0.0 0:00.01 │ ├─ psql -X -a -q -d regression -v HIDE_TABLEAM=on -v HIDE_TOAST_COMPRESSION=on 0.0 0.1 0:02.64 │ └─ postgres -D /opt/postgres/build-farm-14/buildroot/HEAD/pgsql.build/src/test 0.0 0.2 0:00.05 │ ├─ postgres: postgres regression [local] SELECT 0.0 0.0 0:00.06 │ ├─ postgres: logical replication launcher 0.0 0.1 0:00.36 │ ├─ postgres: autovacuum launcher 0.0 0.1 0:00.34 │ ├─ postgres: walwriter 0.0 0.0 0:00.32 │ ├─ postgres: background writer 0.0 0.3 0:00.05 │ └─ postgres: checkpointer - robins >
Re: Why is parula failing?
On Wed, 10 Apr 2024 at 10:24, David Rowley wrote: > > Master failed today for the first time since the compiler upgrade. > Again reltuples == 48. Here's what I can add over the past few days: - Almost all failures are either reltuples=48 or SIGABRTs - Almost all SIGABRTs are DDLs - CREATE INDEX / CREATE AGGREGATEs / CTAS - A little too coincidental? Recent crashes have stack-trace if interested. Barring the initial failures (during move to gcc 13.2), in the past week: - v15 somehow hasn't had a failure yet - v14 / v16 have got only 1 failure each - but v12 / v13 are lit up - failed multiple times. - robins
Re: Why is parula failing?
On Wed, 10 Apr 2024 at 10:24, David Rowley wrote: > Master failed today for the first time since the compiler upgrade. > Again reltuples == 48. >From the buildfarm members page, parula seems to be the only aarch64 + gcc 13.2 combination today, and then I suspect whether this is about gcc v13.2 maturity on aarch64? I'll try to upgrade one of the other aarch64s I have (massasauga or snakefly) and see if this is more about gcc 13.2 maturity on this architecture. - robins
Re: Why is parula failing?
On Tue, 9 Apr 2024 at 15:48, David Rowley wrote: > Still no partition_prune failures on master since the compiler version > change. There has been one [1] in REL_16_STABLE. I'm thinking it > might be worth backpatching the partition_prune debug to REL_16_STABLE > to see if we can learn anything new from it. Master failed today for the first time since the compiler upgrade. Again reltuples == 48. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-04-10%2000%3A27%3A02 David
Re: Why is parula failing?
On Mon, 8 Apr 2024 at 23:56, Robins Tharakan wrote: > #3 0x0083ed84 in WaitLatch (latch=, > wakeEvents=wakeEvents@entry=41, timeout=60, > wait_event_info=wait_event_info@entry=150994946) at latch.c:538 > #4 0x00907404 in pg_sleep (fcinfo=) at misc.c:406 > #17 0x0086a944 in exec_simple_query > (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at > postgres.c:1274 I have no idea why WaitLatch has timeout=60. That should be no higher than timeout=100 for "SELECT pg_sleep(0.1);". I have no theories aside from a failing RAM module, cosmic ray or a well-timed clock change between the first call to gettimeofday() in pg_sleep() and the next one. I know this animal is running debug_parallel_query = regress, so that 0.1 Const did have to get serialized and copied to the worker, so there's another opportunity for the sleep duration to be stomped on, but that seems pretty unlikely. I can't think of a reason why the erroneous reltuples=48 would be consistent over 2 failing runs if it were failing RAM or a cosmic ray. Still no partition_prune failures on master since the compiler version change. There has been one [1] in REL_16_STABLE. I'm thinking it might be worth backpatching the partition_prune debug to REL_16_STABLE to see if we can learn anything new from it. David [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-04-08%2002%3A12%3A02
Re: Why is parula failing?
On Tue, 2 Apr 2024 at 15:01, Tom Lane wrote: > "Tharakan, Robins" writes: > > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries, > > of which some were even"signal 6: Aborted". > > Ugh... parula didn't send any reports to buildfarm for the past 44 hours. Logged in to see that postgres was stuck on pg_sleep(), which was quite odd! I captured the backtrace and triggered another run on HEAD, which came out okay. I'll keep an eye on this instance more often for the next few days. (Let me know if I could capture more if a run gets stuck again) (gdb) bt #0 0x952ae954 in epoll_pwait () from /lib64/libc.so.6 #1 0x0083e9c8 in WaitEventSetWaitBlock (nevents=1, occurred_events=, cur_timeout=297992, set=0x2816dac0) at latch.c:1570 #2 WaitEventSetWait (set=0x2816dac0, timeout=timeout@entry=60, occurred_events=occurred_events@entry=0xc395ed28, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=150994946) at latch.c:1516 #3 0x0083ed84 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=41, timeout=60, wait_event_info=wait_event_info@entry=150994946) at latch.c:538 #4 0x00907404 in pg_sleep (fcinfo=) at misc.c:406 #5 0x00696b10 in ExecInterpExpr (state=0x28384040, econtext=0x28383e38, isnull=) at execExprInterp.c:764 #6 0x006ceef8 in ExecEvalExprSwitchContext (isNull=0xc395ee9f, econtext=0x28383e38, state=) at ../../../src/include/executor/executor.h:356 #7 ExecProject (projInfo=) at ../../../src/include/executor/executor.h:390 #8 ExecResult (pstate=) at nodeResult.c:135 #9 0x006b7aec in ExecProcNode (node=0x28383d28) at ../../../src/include/executor/executor.h:274 #10 gather_getnext (gatherstate=0x28383b38) at nodeGather.c:287 #11 ExecGather (pstate=0x28383b38) at nodeGather.c:222 #12 0x0069aa4c in ExecProcNode (node=0x28383b38) at ../../../src/include/executor/executor.h:274 #13 ExecutePlan (execute_once=, dest=0x2831ffb0, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x28383b38, estate=0x28383910) at execMain.c:1646 #14 standard_ExecutorRun (queryDesc=0x283239c0, direction=, count=0, execute_once=) at execMain.c:363 #15 0x0086d454 in PortalRunSelect (portal=portal@entry=0x281f0fb0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x2831ffb0) at pquery.c:924 #16 0x0086ec70 in PortalRun (portal=portal@entry=0x281f0fb0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x2831ffb0, altdest=altdest@entry=0x2831ffb0, qc=qc@entry=0xc395f250) at pquery.c:768 #17 0x0086a944 in exec_simple_query (query_string=query_string@entry=0x28171c90 "SELECT pg_sleep(0.1);") at postgres.c:1274 #18 0x0086b480 in PostgresMain (dbname=, username=) at postgres.c:4680 #19 0x00866a0c in BackendMain (startup_data=, startup_data_len=) at backend_startup.c:101 #20 0x007c1738 in postmaster_child_launch (child_type=child_type@entry=B_BACKEND, startup_data=startup_data@entry=0xc395f718 "", startup_data_len=startup_data_len@entry=4, client_sock=client_sock@entry=0xc395f720) at launch_backend.c:265 #21 0x007c5120 in BackendStartup (client_sock=0xc395f720) at postmaster.c:3593 #22 ServerLoop () at postmaster.c:1674 #23 0x007c6dc8 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x2816d320) at postmaster.c:1372 #24 0x00496bb8 in main (argc=8, argv=0x2816d320) at main.c:197 > > The update_personality.pl script in the buildfarm client distro > is what to use to adjust OS version or compiler version data. > Thanks. Fixed that. - robins
Re: Why is parula failing?
"Tharakan, Robins" writes: >> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests >> stabilize now. > So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of > which passed on subsequent-tries, > of which some were even"signal 6: Aborted". Ugh... > Also, it'd be great if someone could point me to a way to update the > "Compiler" section in "System Detail" on > the buildfarm page (it wrongly shows GCC as v7.3.1). The update_personality.pl script in the buildfarm client distro is what to use to adjust OS version or compiler version data. regards, tom lane
RE: Why is parula failing?
> I've now switched to GCC v13.2 and triggered a run. Let's see if the tests > stabilize now. So although HEAD ran fine, but I saw multiple failures (v12, v13, v16) all of which passed on subsequent-tries, of which some were even"signal 6: Aborted". FWIW, I compiled gcc v13.2 (default options) from source which IIUC shouldn't be to blame. Two other possible reasons could be that the buildfarm doesn't have an aarch64 + gcc 13.2 combination (quick check I couldn't see any), or, that this hardware is flaky. Either way, this instance is a throw-away so let me know if this isn't helping. I'll swap it out in case there's not much benefit to be had. Also, it'd be great if someone could point me to a way to update the "Compiler" section in "System Detail" on the buildfarm page (it wrongly shows GCC as v7.3.1). - thanks robins
RE: Why is parula failing?
> ... in connection with which, I can't help noticing that parula is using a > very old compiler: > > configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) > > From some quick checking around, that would have to be near the beginning of > aarch64 > support in RHEL (Fedora hadn't promoted aarch64 to a primary architecture > until earlier > that same year). It's not exactly hard to believe that there were some > lingering compiler bugs. > I wonder why parula is using that when its underlying system seems markedly > newer (the kernel at least has a recent build date). Parula used GCC v7.3.1 since that's what came by default. I've now switched to GCC v13.2 and triggered a run. Let's see if the tests stabilize now. - Robins
Re: Why is parula failing?
David Rowley writes: > On Sat, 30 Mar 2024 at 09:17, Tom Lane wrote: >> ... in connection with which, I can't help noticing that parula >> is using a very old compiler: >> configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) >> I wonder why parula is using that when its underlying system seems >> markedly newer (the kernel at least has a recent build date). > It could be, but wouldn't the bug have to relate to the locking code > and be caused by some other backend stomping on the memory? > Otherwise, shouldn't it be failing consistently every run rather than > sporadically? Your guess is as good as mine ... but we still haven't seen this class of failure on any other animal, so the idea that it's strictly a chance timing issue is getting weaker all the time. regards, tom lane
Re: Why is parula failing?
On Sat, 30 Mar 2024 at 09:17, Tom Lane wrote: > > I wrote: > > I'd not looked closely enough at the previous failure, because > > now that I have, this is well out in WTFF territory: how can > > reltuples be greater than zero when relpages is zero? This can't > > be a state that autovacuum would have left behind, unless it's > > really seriously broken. I think we need to be looking for > > explanations like "memory stomp" or "compiler bug". > > ... in connection with which, I can't help noticing that parula > is using a very old compiler: > > configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) > > From some quick checking around, that would have to be near the > beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64 > to a primary architecture until earlier that same year). It's not > exactly hard to believe that there were some lingering compiler bugs. > I wonder why parula is using that when its underlying system seems > markedly newer (the kernel at least has a recent build date). It could be, but wouldn't the bug have to relate to the locking code and be caused by some other backend stomping on the memory? Otherwise, shouldn't it be failing consistently every run rather than sporadically? David
Re: Why is parula failing?
I wrote: > I'd not looked closely enough at the previous failure, because > now that I have, this is well out in WTFF territory: how can > reltuples be greater than zero when relpages is zero? This can't > be a state that autovacuum would have left behind, unless it's > really seriously broken. I think we need to be looking for > explanations like "memory stomp" or "compiler bug". ... in connection with which, I can't help noticing that parula is using a very old compiler: configure: using compiler=gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) >From some quick checking around, that would have to be near the beginning of aarch64 support in RHEL (Fedora hadn't promoted aarch64 to a primary architecture until earlier that same year). It's not exactly hard to believe that there were some lingering compiler bugs. I wonder why parula is using that when its underlying system seems markedly newer (the kernel at least has a recent build date). regards, tom lane
Re: Why is parula failing?
David Rowley writes: > On Wed, 27 Mar 2024 at 18:28, Tom Lane wrote: >> Let's wait a bit to see if it fails in HEAD ... but if not, would >> it be reasonable to back-patch the additional debugging output? > I think REL_16_STABLE has told us that it's not an auto-vacuum issue. > I'm uncertain what a few more failures in master will tell us aside > from if reltuples == 48 is consistent or if that value is going to > fluctuate. > Let's give it a week and see if it fails a few more times. We have another failure today [1] with the same symptom: ab_a2 |0 |-1 ||0 | 0 - ab_a2_b1 |0 |-1 ||0 | 0 + ab_a2_b1 |0 |48 ||0 | 0 ab_a2_b1_a_idx |1 | 0 | t | | Different table, same "48" reltuples. But I have to confess that I'd not looked closely enough at the previous failure, because now that I have, this is well out in WTFF territory: how can reltuples be greater than zero when relpages is zero? This can't be a state that autovacuum would have left behind, unless it's really seriously broken. I think we need to be looking for explanations like "memory stomp" or "compiler bug". regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula=2024-03-29%2012%3A46%3A02
Re: Why is parula failing?
On Wed, 27 Mar 2024 at 18:28, Tom Lane wrote: > > David Rowley writes: > > Unfortunately, REL_16_STABLE does not have the additional debugging, > > so don't get to know what reltuples was set to. > > Let's wait a bit to see if it fails in HEAD ... but if not, would > it be reasonable to back-patch the additional debugging output? I think REL_16_STABLE has told us that it's not an auto-vacuum issue. I'm uncertain what a few more failures in master will tell us aside from if reltuples == 48 is consistent or if that value is going to fluctuate. Let's give it a week and see if it fails a few more times. David
Re: Why is parula failing?
David Rowley writes: > Unfortunately, REL_16_STABLE does not have the additional debugging, > so don't get to know what reltuples was set to. Let's wait a bit to see if it fails in HEAD ... but if not, would it be reasonable to back-patch the additional debugging output? regards, tom lane
Re: Why is parula failing?
On Tue, 26 Mar 2024 at 21:03, Tharakan, Robins wrote: > > > David Rowley writes: > > It would be good to have log_autovacuum_min_duration = 0 on this machine > > for a while. > > - Have set log_autovacuum_min_duration=0 on parula and a test run came out > okay. > - Also added REL_16_STABLE to the branches being tested (in case it matters > here). Thanks for doing that. I see PG_16_STABLE has failed twice already with the same issue. I don't see any autovacuum / autoanalyze in the log, so I guess that rules out auto vacuum activity causing this. Unfortunately, REL_16_STABLE does not have the additional debugging, so don't get to know what reltuples was set to. David
RE: Why is parula failing?
Hi David / Tom, > David Rowley writes: > It would be good to have log_autovacuum_min_duration = 0 on this machine for > a while. - Have set log_autovacuum_min_duration=0 on parula and a test run came out okay. - Also added REL_16_STABLE to the branches being tested (in case it matters here). Let me know if I can help with any other changes here. - Robins | tharar@ | adelaide@australia
Re: Why is parula failing?
On Thu, 21 Mar 2024 at 14:19, Tom Lane wrote: > > David Rowley writes: > > We could also do something like the attached just in case we're > > barking up the wrong tree. > > Yeah, checking indisvalid isn't a bad idea. I'd put another > one further down, just before the DROP of table ab, so we > can see the state both before and after the unstable tests. So it's taken quite a while to finally fail again. Effectively, we're getting: relname | relpages | reltuples | indisvalid | autovacuum_count | autoanalyze_count +--+---++--+--- - ab_a2_b2 |0 |-1 || 0 | 0 + ab_a2_b2 |0 |48 || 0 | 0 I see AddNewRelationTuple() does set reltuples to -1, so I can't quite figure out why 48 is in there. Even if auto-analyze had somehow mistakenly run and the autoanalyze_count stats just were not up-to-date yet, the table has zero blocks, and I don't see how acquire_sample_rows() would set *totalrows to anything other than 0.0 in this case. For the vacuum case, I see that reltuples is set from: /* now we can compute the new value for pg_class.reltuples */ vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages, vacrel->scanned_pages, vacrel->live_tuples); Again, hard to see how that could come to anything other than zero given that rel_pages and scanned_pages should be 0. Looking at the binary representation of a float of -1 vs 48, they're not nearly the same. 0xBF80 vs 0x4240, so it's not looking like a flipped bit. It would be good to have log_autovacuum_min_duration = 0 on this machine for a while. David
Re: Why is parula failing?
David Rowley writes: > We could also do something like the attached just in case we're > barking up the wrong tree. Yeah, checking indisvalid isn't a bad idea. I'd put another one further down, just before the DROP of table ab, so we can see the state both before and after the unstable tests. regards, tom lane
Re: Why is parula failing?
On Thu, 21 Mar 2024 at 12:36, Tom Lane wrote: > So yeah, if we could have log_autovacuum_min_duration = 0 perhaps > that would yield a clue. FWIW, I agree with your earlier statement about it looking very much like auto-vacuum has run on that table, but equally, if something like the pg_index record was damaged we could get the same plan change. We could also do something like the attached just in case we're barking up the wrong tree. David diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out index 7ca98397ae..f6950e160d 100644 --- a/src/test/regress/expected/partition_prune.out +++ b/src/test/regress/expected/partition_prune.out @@ -2449,6 +2449,37 @@ create index ab_a3_b3_a_idx on ab_a3_b3 (a); set enable_hashjoin = 0; set enable_mergejoin = 0; set enable_memoize = 0; +-- Temporarily install some debugging to investigate plan instability. +select c.relname,c.relpages,c.reltuples,i.indisvalid,s.autovacuum_count,s.autoanalyze_count +from pg_class c +left join pg_stat_all_tables s on c.oid = s.relid +left join pg_index i on c.oid = i.indexrelid +where c.relname like 'ab\_%' order by c.relname; +relname | relpages | reltuples | indisvalid | autovacuum_count | autoanalyze_count ++--+---++--+--- + ab_a1 |0 |-1 ||0 | 0 + ab_a1_b1 |0 |-1 ||0 | 0 + ab_a1_b1_a_idx |1 | 0 | t | | + ab_a1_b2 |0 |-1 ||0 | 0 + ab_a1_b2_a_idx |1 | 0 | t | | + ab_a1_b3 |0 |-1 ||0 | 0 + ab_a1_b3_a_idx |1 | 0 | t | | + ab_a2 |0 |-1 ||0 | 0 + ab_a2_b1 |0 |-1 ||0 | 0 + ab_a2_b1_a_idx |1 | 0 | t | | + ab_a2_b2 |0 |-1 ||0 | 0 + ab_a2_b2_a_idx |1 | 0 | t | | + ab_a2_b3 |0 |-1 ||0 | 0 + ab_a2_b3_a_idx |1 | 0 | t | | + ab_a3 |0 |-1 ||0 | 0 + ab_a3_b1 |0 |-1 ||0 | 0 + ab_a3_b1_a_idx |1 | 0 | t | | + ab_a3_b2 |0 |-1 ||0 | 0 + ab_a3_b2_a_idx |1 | 0 | t | | + ab_a3_b3 |0 |-1 ||0 | 0 + ab_a3_b3_a_idx |1 | 0 | t | | +(21 rows) + select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(0, 0, 1)'); explain_parallel_append diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql index a09b27d820..3ce90f2d89 100644 --- a/src/test/regress/sql/partition_prune.sql +++ b/src/test/regress/sql/partition_prune.sql @@ -607,6 +607,13 @@ set enable_hashjoin = 0; set enable_mergejoin = 0; set enable_memoize = 0; +-- Temporarily install some debugging to investigate plan instability. +select c.relname,c.relpages,c.reltuples,i.indisvalid,s.autovacuum_count,s.autoanalyze_count +from pg_class c +left join pg_stat_all_tables s on c.oid = s.relid +left join pg_index i on c.oid = i.indexrelid +where c.relname like 'ab\_%' order by c.relname; + select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(0, 0, 1)'); -- Ensure the same partitions are pruned when we make the nested loop
Re: Why is parula failing?
David Rowley writes: > Is it worth running that animal with log_autovacuum_min_duration = 0 > so we can see what's going on in terms of auto-vacuum auto-analyze in > the log? Maybe, but I'm not sure. I thought that if parula were somehow hitting an ill-timed autovac/autoanalyze, it should be possible to make that reproducible by inserting "pg_sleep(60)" or so in the test script, to give the autovac daemon plenty of time to come around and do the dirty deed. No luck though --- the results didn't change for me. So now I'm not sure what is going on. Perhaps though it is autovacuum, and there's some environment-specific enabling condition that parula has and my machine doesn't (which could also help explain why no other animal is doing this). So yeah, if we could have log_autovacuum_min_duration = 0 perhaps that would yield a clue. regards, tom lane
Re: Why is parula failing?
On Wed, 20 Mar 2024 at 08:58, Tom Lane wrote: > I suppose we could attach "autovacuum=off" settings to these tables, > but it doesn't seem to me that that should be necessary. These test > cases are several years old and haven't given trouble before. > Moreover, if that's necessary then there are a lot of other regression > tests that would presumably need the same treatment. Is it worth running that animal with log_autovacuum_min_duration = 0 so we can see what's going on in terms of auto-vacuum auto-analyze in the log? David
Re: Why is parula failing?
On Wed, 20 Mar 2024 at 11:50, Matthias van de Meent wrote: > > On Tue, 19 Mar 2024 at 20:58, Tom Lane wrote: > > > > For the last few days, buildfarm member parula has been intermittently > > failing the partition_prune regression test, due to unexpected plan > > changes [1][2][3][4]. The symptoms can be reproduced exactly by > > inserting a "vacuum" of one or another of the partitions of table > > "ab", so we can presume that the underlying cause is an autovacuum run > > against one of those tables. But why is that happening? None of > > those tables receive any insertions during the test, so I don't > > understand why autovacuum would trigger on them. > This may be purely coincidental, but yesterday I also did have a > seemingly random failure in the recovery test suite locally, in > t/027_stream_regress.pl, where it changed the join order of exactly > one of the queries (that uses the tenk table multiple times, iirc 3x > or so). [...] > Sadly, I did not save the output of that test run, so this is just > about all the information I have. The commit I was testing on was > based on ca108be7, and system config is available if needed. It looks like Cirrus CI reproduced my issue with recent commit a0390f6c [0] and previously also with 4665cebc [1], 875e46a0 [2], 449e798c [3], and other older commits, on both the Windows Server 2019 build and Debian for a39f1a36 (with slightly different plan changes on this Debian run). That should rule out most of the environments. After analyzing the logs produced by the various primaries, I can't find a good explanation why they would have this issue. The table is vacuum analyzed before the regression tests start, and in this run autovacuum/autoanalyze doesn't seem to kick in until (at least) seconds after this query was run. Kind regards, Matthias van de Meent [0] https://cirrus-ci.com/task/6295909005262848 [1] https://cirrus-ci.com/task/5229745516838912 [2] https://cirrus-ci.com/task/5098544567156736 [3] https://cirrus-ci.com/task/4783906419900416
Re: Why is parula failing?
On Tue, 19 Mar 2024 at 20:58, Tom Lane wrote: > > For the last few days, buildfarm member parula has been intermittently > failing the partition_prune regression test, due to unexpected plan > changes [1][2][3][4]. The symptoms can be reproduced exactly by > inserting a "vacuum" of one or another of the partitions of table > "ab", so we can presume that the underlying cause is an autovacuum run > against one of those tables. But why is that happening? None of > those tables receive any insertions during the test, so I don't > understand why autovacuum would trigger on them. > > I suppose we could attach "autovacuum=off" settings to these tables, > but it doesn't seem to me that that should be necessary. These test > cases are several years old and haven't given trouble before. > Moreover, if that's necessary then there are a lot of other regression > tests that would presumably need the same treatment. > > I'm also baffled why this wasn't happening before. I scraped the > buildfarm logs for 3 months back and confirmed my impression that > this is a new failure mode. But one of these four runs is on > REL_14_STABLE, eliminating the theory that the cause is a recent > HEAD-only change. > > Any ideas? This may be purely coincidental, but yesterday I also did have a seemingly random failure in the recovery test suite locally, in t/027_stream_regress.pl, where it changed the join order of exactly one of the queries (that uses the tenk table multiple times, iirc 3x or so). As the work I was doing wasn't related to join order-related problems, this surprised me. After checking for test file changes (none), I re-ran the tests without recompilation and the failure went away, so I attributed this to an untimely autoanalize. However, as this was also an unexpected plan change in the tests this could be related. Sadly, I did not save the output of that test run, so this is just about all the information I have. The commit I was testing on was based on ca108be7, and system config is available if needed. -Matthias