To what extent should tests rely on VACUUM ANALYZE?

Alexander Lakhin Thu, 28 Mar 2024 08:00:19 -0700

Hello hackers,

When running multiple 027_stream_regress.pl test instances in parallel
(and with aggressive autovacuum) on a rather slow machine, I encountered
test failures due to the subselect test instability just as the following
failures on buildfarm:
1) 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-03-27%2010%3A16%3A12


--- 
/home/bf/bf-build/grassquit/HEAD/pgsql/src/test/regress/expected/subselect.out 
2024-03-19 22:20:34.435867114 +0000

+++ /home/bf/bf-build/grassquit/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/subselect.out2024-03-27 10:28:38.185776605 +0000

@@ -2067,16 +2067,16 @@
                    QUERY PLAN
 -------------------------------------------------
  Hash Join
-   Hash Cond: (c.odd = b.odd)
+   Hash Cond: (c.hundred = a.hundred)
    ->  Hash Join
-         Hash Cond: (a.hundred = c.hundred)
-         ->  Seq Scan on tenk1 a
+         Hash Cond: (b.odd = c.odd)
+         ->  Seq Scan on tenk2 b
          ->  Hash
                ->  HashAggregate
                      Group Key: c.odd, c.hundred
                      ->  Seq Scan on tenk2 c
    ->  Hash
-         ->  Seq Scan on tenk2 b
+         ->  Seq Scan on tenk1 a
 (11 rows)

2) 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-03-27%2009%3A49%3A38

(That query was added recently (by 9f1337639 from 2023-02-15) and the
failure evidentially depends on timing, so the number of the failures I
could find on buildfarm is moderate for now.)

With the subselect test modified as in attached, I could see what makes
the plan change:
-                     ->  Seq Scan on public.tenk2 c (cost=0.00..445.00 
rows=10000 width=8)
+                     ->  Seq Scan on public.tenk2 c (cost=0.00..444.95 
rows=9995 width=8)

  relname | relpages | reltuples | autovacuum_count | autoanalyze_count
 ---------+----------+-----------+------------------+-------------------
- tenk2   |      345 |     10000 |                0 |                 0
+ tenk2   |      345 |      9995 |                0 |                 0

Using the trick Thomas proposed in [1] (see my modification attached), I
could reproduce the failure easily on my workstation with no specific
conditions:

2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup LOG: !!!ConditionalLockBufferForCleanup()returning false2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup CONTEXT: while scanning block 29 of relation"public.tenk2"

2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup 
STATEMENT:  VACUUM ANALYZE tenk2;
...
  relname | relpages | reltuples | autovacuum_count | autoanalyze_count
 ---------+----------+-----------+------------------+-------------------
- tenk2   |      345 |     10000 |                0 |                 0
+ tenk2   |      345 |      9996 |                0 |                 0
 (1 row)

So it looks to me like a possible cause of the failure, and I wonder
whether checks for query plans should be immune to such changes or results
of VACUUM ANALYZE should be 100% stable?

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGKYNHmL_DhmVRiidHv6YLAL8jViifwwn2ABY__Y3BCphg%40mail.gmail.com

Best regards,
Alexander

diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 29b11f11aa..6a7bb6b7a9 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -2079,6 +2079,32 @@ ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
          ->  Seq Scan on tenk2 b
 (11 rows)
 
+explain (verbose)
+SELECT * FROM tenk1 A INNER JOIN tenk2 B
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+                                                                                                                                                                       QUERY PLAN                                                                                                                                                                       
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ Hash Join  (cost=1087.50..13845.00 rows=1000000 width=488)
+   Output: a.unique1, a.unique2, a.two, a.four, a.ten, a.twenty, a.hundred, a.thousand, a.twothousand, a.fivethous, a.tenthous, a.odd, a.even, a.stringu1, a.stringu2, a.string4, b.unique1, b.unique2, b.two, b.four, b.ten, b.twenty, b.hundred, b.thousand, b.twothousand, b.fivethous, b.tenthous, b.odd, b.even, b.stringu1, b.stringu2, b.string4
+   Hash Cond: (c.odd = b.odd)
+   ->  Hash Join  (cost=517.50..2000.00 rows=10000 width=248)
+         Output: a.unique1, a.unique2, a.two, a.four, a.ten, a.twenty, a.hundred, a.thousand, a.twothousand, a.fivethous, a.tenthous, a.odd, a.even, a.stringu1, a.stringu2, a.string4, c.odd
+         Hash Cond: (a.hundred = c.hundred)
+         ->  Seq Scan on public.tenk1 a  (cost=0.00..445.00 rows=10000 width=244)
+               Output: a.unique1, a.unique2, a.two, a.four, a.ten, a.twenty, a.hundred, a.thousand, a.twothousand, a.fivethous, a.tenthous, a.odd, a.even, a.stringu1, a.stringu2, a.string4
+         ->  Hash  (cost=505.00..505.00 rows=1000 width=8)
+               Output: c.odd, c.hundred
+               ->  HashAggregate  (cost=495.00..505.00 rows=1000 width=8)
+                     Output: c.odd, c.hundred
+                     Group Key: c.odd, c.hundred
+                     ->  Seq Scan on public.tenk2 c  (cost=0.00..445.00 rows=10000 width=8)
+                           Output: c.odd, c.hundred
+   ->  Hash  (cost=445.00..445.00 rows=10000 width=244)
+         Output: b.unique1, b.unique2, b.two, b.four, b.ten, b.twenty, b.hundred, b.thousand, b.twothousand, b.fivethous, b.tenthous, b.odd, b.even, b.stringu1, b.stringu2, b.string4
+         ->  Seq Scan on public.tenk2 b  (cost=0.00..445.00 rows=10000 width=244)
+               Output: b.unique1, b.unique2, b.two, b.four, b.ten, b.twenty, b.hundred, b.thousand, b.twothousand, b.fivethous, b.tenthous, b.odd, b.even, b.stringu1, b.stringu2, b.string4
+(19 rows)
+
 -- we can pull up the aggregate sublink into RHS of a left join.
 explain (costs off)
 SELECT * FROM tenk1 A LEFT JOIN tenk2 B
@@ -2103,3 +2129,18 @@ ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                                                Filter: (odd = b.odd)
 (16 rows)
 
+select c.relname,c.relpages,c.reltuples,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+where c.relname = 'tenk2';
+ relname | relpages | reltuples | autovacuum_count | autoanalyze_count 
+---------+----------+-----------+------------------+-------------------
+ tenk2   |      345 |     10000 |                0 |                 0
+(1 row)
+
+select count(*) from tenk2;
+ count 
+-------
+ 10000
+(1 row)
+
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index 7c42ebc36f..aef2e9f2c6 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -1015,7 +1015,18 @@ explain (costs off)
 SELECT * FROM tenk1 A INNER JOIN tenk2 B
 ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
 
+explain (verbose)
+SELECT * FROM tenk1 A INNER JOIN tenk2 B
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+
 -- we can pull up the aggregate sublink into RHS of a left join.
 explain (costs off)
 SELECT * FROM tenk1 A LEFT JOIN tenk2 B
-ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
\ No newline at end of file
+ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
+
+select c.relname,c.relpages,c.reltuples,s.autovacuum_count,s.autoanalyze_count
+from pg_class c
+left join pg_stat_all_tables s on c.oid = s.relid
+where c.relname = 'tenk2';
+
+select count(*) from tenk2;

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..25f7327a82 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -58,6 +58,7 @@
 #include "utils/resowner.h"
 #include "utils/timestamp.h"
 
+#include "tcop/tcopprot.h"
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -5041,6 +5042,8 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 
 	Assert(BufferIsValid(buffer));
 
+if (debug_query_string && (strcmp(debug_query_string, "VACUUM ANALYZE tenk2;") == 0) && (rand() % 100 == 0))
+{ elog(LOG, "!!!ConditionalLockBufferForCleanup() returning false"); return false; }
 	if (BufferIsLocal(buffer))
 	{
 		refcount = LocalRefCount[-buffer - 1];

To what extent should tests rely on VACUUM ANALYZE?

Reply via email to