Re: stats test intermittent failure
Andres Freund writes: > I integrated the suggested change of the comment and tweaked it a bit > more. And finally pushed the fix. This failure was originally seen on v16 (that is, pre-fork). Shouldn't the fix be back-patched? regards, tom lane
Re: stats test intermittent failure
Hi, On 2023-07-31 21:03:07 +0900, Masahiko Sawada wrote: > Regarding the patch, I have a comment: > > -- Test that reuse of strategy buffers and reads of blocks into these reused > --- buffers while VACUUMing are tracked in pg_stat_io. > +-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient > +-- demand for shared buffers from concurrent queries, some blocks may be > +-- evicted from the strategy ring before they can be reused. In such cases > +-- this, the backend will evict a block from a shared buffer outside of the > +-- ring and add it to the ring. This is considered an eviction and not a > reuse. > > The new comment seems not to be accurate if my understanding is correct. How > about the following? > > Test that reuse of strategy buffers and reads of blocks into these > reused buffers while VACUUMing are tracked in pg_stat_io. If there is > sufficient demand for shared buffers from concurrent queries, some > buffers may be pinned by other backends before they can be reused. In > such cases, the backend will evict a buffer from a shared buffer > outside of the ring and add it to the ring. This is considered an > eviction and not a reuse. I integrated the suggested change of the comment and tweaked it a bit more. And finally pushed the fix. Sorry that it took so long. Greetings, Andres Freund
Re: stats test intermittent failure
Hi, On Tue, Jul 11, 2023 at 3:35 AM Melanie Plageman wrote: > > Hi, > > Jeff pointed out that one of the pg_stat_io tests has failed a few times > over the past months (here on morepork [1] and more recently here on > francolin [2]). > > Failing test diff for those who prefer not to scroll: > > +++ > /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out >2023-07-07 18:48:25.976313231 + > @@ -1415,7 +1415,7 @@ > :io_sum_vac_strategy_after_reuses > > :io_sum_vac_strategy_before_reuses; > ?column? | ?column? > --+-- > - t| t > + t| f > > My theory about the test failure is that, when there is enough demand > for shared buffers, the flapping test fails because it expects buffer > access strategy *reuses* and concurrent queries already flushed those > buffers before they could be reused. Attached is a patch which I think > will fix the test while keeping some code coverage. If we count > evictions and reuses together, those should have increased. > Yeah, I've not reproduced this issue but it's possible. IIUC if we get the buffer from the ring, we count an I/O as "reuse" even if the buffer has already been flushed/replaced. However, if the buffer in the ring is pinned by other backends, we end up evicting a buffer from outside of the ring and adding it to the buffer, which is counted as "eviction". Regarding the patch, I have a comment: -- Test that reuse of strategy buffers and reads of blocks into these reused --- buffers while VACUUMing are tracked in pg_stat_io. +-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient +-- demand for shared buffers from concurrent queries, some blocks may be +-- evicted from the strategy ring before they can be reused. In such cases +-- this, the backend will evict a block from a shared buffer outside of the +-- ring and add it to the ring. This is considered an eviction and not a reuse. The new comment seems not to be accurate if my understanding is correct. How about the following? Test that reuse of strategy buffers and reads of blocks into these reused buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient demand for shared buffers from concurrent queries, some buffers may be pinned by other backends before they can be reused. In such cases, the backend will evict a buffer from a shared buffer outside of the ring and add it to the ring. This is considered an eviction and not a reuse. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: stats test intermittent failure
Hi Melanie, 10.07.2023 21:35, Melanie Plageman wrote: Hi, Jeff pointed out that one of the pg_stat_io tests has failed a few times over the past months (here on morepork [1] and more recently here on francolin [2]). Failing test diff for those who prefer not to scroll: +++ /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2023-07-07 18:48:25.976313231 + @@ -1415,7 +1415,7 @@ :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses; ?column? | ?column? --+-- - t| t + t| f My theory about the test failure is that, when there is enough demand for shared buffers, the flapping test fails because it expects buffer access strategy *reuses* and concurrent queries already flushed those buffers before they could be reused. Attached is a patch which I think will fix the test while keeping some code coverage. If we count evictions and reuses together, those should have increased. I managed to reproduce that failure with the attached patch applied (on master) and with the following script (that effectively multiplies probability of the failure by 360): CPPFLAGS="-O0" ./configure -q --enable-debug --enable-cassert --enable-tap-tests && make -s -j`nproc` && make -s check -C src/test/recovery mkdir -p src/test/recovery00/t cp src/test/recovery/t/027_stream_regress.pl src/test/recovery00/t/ cp src/test/recovery/Makefile src/test/recovery00/ for ((i=1;i<=9;i++)); do cp -r src/test/recovery00/ src/test/recovery$i; done for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j9 --linebuffer --tag make -s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 recovery8 recovery9 || break; done Without your patch, I get: iteration 2 ... recovery5 # Failed test 'regression tests pass' recovery5 # at t/027_stream_regress.pl line 92. recovery5 # got: '256' recovery5 # expected: '0' ... src/test/recovery5/tmp_check/log/regress_log_027_stream_regress contains: --- .../src/test/regress/expected/stats.out 2023-07-11 20:05:10.536059706 +0300 +++ .../src/test/recovery5/tmp_check/results/stats.out 2023-07-11 20:30:46.790551305 +0300 @@ -1418,7 +1418,7 @@ :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses; ?column? | ?column? --+-- - t | t + t | f (1 row) With your patch applied, 10 iterations performed successfully for me. So it looks like your theory and your fix are correct. Best regards, Alexanderdiff --git a/src/test/regress/expected/compression.out b/src/test/regress/expected/compression.out index 834b7555cb..6a84afd819 100644 --- a/src/test/regress/expected/compression.out +++ b/src/test/regress/expected/compression.out @@ -1,3 +1,6 @@ +SET client_min_messages TO 'warning'; +DROP TABLE IF EXISTS cmdata, cmdata1, cmmove1, cmmove3, cmdata2, cmmove2, cmpart, cmpart2 CASCADE; +RESET client_min_messages; \set HIDE_TOAST_COMPRESSION false -- ensure we get stable results regardless of installation's default SET default_toast_compression = 'pglz'; diff --git a/src/test/regress/expected/compression_1.out b/src/test/regress/expected/compression_1.out index ddcd137c49..7ff83d2472 100644 --- a/src/test/regress/expected/compression_1.out +++ b/src/test/regress/expected/compression_1.out @@ -1,3 +1,6 @@ +SET client_min_messages TO 'warning'; +DROP TABLE IF EXISTS cmdata, cmdata1, cmmove1, cmmove3, cmdata2, cmmove2, cmpart, cmpart2 CASCADE; +RESET client_min_messages; \set HIDE_TOAST_COMPRESSION false -- ensure we get stable results regardless of installation's default SET default_toast_compression = 'pglz'; diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out index 1aca77491b..d8ad54ec5d 100644 --- a/src/test/regress/expected/explain.out +++ b/src/test/regress/expected/explain.out @@ -1,3 +1,6 @@ +SET client_min_messages TO 'warning'; +DROP FUNCTION IF EXISTS explain_filter, explain_filter_to_json CASCADE; +RESET client_min_messages; -- -- EXPLAIN -- diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out index 3e5645c2ab..cbd5d3affb 100644 --- a/src/test/regress/expected/indexing.out +++ b/src/test/regress/expected/indexing.out @@ -1,3 +1,7 @@ +SET client_min_messages TO 'warning'; +DROP TABLE IF EXISTS idxpart, idxpart_another, covidxpart, covidxpart3, covidxpart4 CASCADE; +DROP SCHEMA IF EXISTS regress_indexing CASCADE; +RESET client_min_messages; -- Creating an index on a partitioned table makes the partitions -- automatically get the index create table idxpart (a int, b int, c text) partition by range (a); diff --git a/src/test/regress/expected/memoize.out b/src/test/regress/expected/memoize.out index f5202430f8..8577958685 100644 --- a/src/test/regress/expected/memoize.out +++ b/src/test/regress/expected/memoize.out @@ -1,3 +1,6 @@
stats test intermittent failure
Hi, Jeff pointed out that one of the pg_stat_io tests has failed a few times over the past months (here on morepork [1] and more recently here on francolin [2]). Failing test diff for those who prefer not to scroll: +++ /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2023-07-07 18:48:25.976313231 + @@ -1415,7 +1415,7 @@ :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses; ?column? | ?column? --+-- - t| t + t| f My theory about the test failure is that, when there is enough demand for shared buffers, the flapping test fails because it expects buffer access strategy *reuses* and concurrent queries already flushed those buffers before they could be reused. Attached is a patch which I think will fix the test while keeping some code coverage. If we count evictions and reuses together, those should have increased. - Melanie [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork=2023-06-16%2018%3A30%3A32 [2] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=francolin=2023-07-07%2018%3A43%3A57=recovery-check From cf41551431751a2b03e0fc08b7296301fba81992 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 10 Jul 2023 13:57:36 -0400 Subject: [PATCH v1] Fix pg_stat_io buffer reuse test instability The stats regression test attempts to ensure that Buffer Access Strategy "reuses" are being counted in pg_stat_io by vacuuming a table which is larger than the size of the strategy ring. However, when shared buffers are in sufficiently high demand, another backend could evict one of the blocks in the strategy ring before the first backend has a chance to reuse the buffer. The backend using the strategy would then evict a block from another shared buffer and add that buffer to the strategy ring. This counts as an eviction and not a reuse in pg_stat_io. Count both evictions and reuses in the test to ensure it does not fail incorrectly. Reported-by: Jeff Davis --- src/test/regress/expected/stats.out | 30 +++-- src/test/regress/sql/stats.sql | 19 +++--- 2 files changed, 32 insertions(+), 17 deletions(-) diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out index 8e63340782..2ea6bbca7a 100644 --- a/src/test/regress/expected/stats.out +++ b/src/test/regress/expected/stats.out @@ -1385,7 +1385,11 @@ SELECT :io_sum_local_new_tblspc_writes > :io_sum_local_after_writes; RESET temp_buffers; -- Test that reuse of strategy buffers and reads of blocks into these reused --- buffers while VACUUMing are tracked in pg_stat_io. +-- buffers while VACUUMing are tracked in pg_stat_io. If there is sufficient +-- demand for shared buffers from concurrent queries, some blocks may be +-- evicted from the strategy ring before they can be reused. In such cases +-- this, the backend will evict a block from a shared buffer outside of the +-- ring and add it to the ring. This is considered an eviction and not a reuse. -- Set wal_skip_threshold smaller than the expected size of -- test_io_vac_strategy so that, even if wal_level is minimal, VACUUM FULL will -- fsync the newly rewritten test_io_vac_strategy instead of writing it to WAL. @@ -1393,15 +1397,15 @@ RESET temp_buffers; -- shared buffers -- preventing us from testing BAS_VACUUM BufferAccessStrategy -- reads. SET wal_skip_threshold = '1 kB'; -SELECT sum(reuses) AS reuses, sum(reads) AS reads +SELECT sum(reuses) AS reuses, sum(reads) AS reads, sum(evictions) AS evictions FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_before_ CREATE TABLE test_io_vac_strategy(a int, b int) WITH (autovacuum_enabled = 'false'); INSERT INTO test_io_vac_strategy SELECT i, i from generate_series(1, 4500)i; -- Ensure that the next VACUUM will need to perform IO by rewriting the table -- first with VACUUM (FULL). VACUUM (FULL) test_io_vac_strategy; --- Use the minimum BUFFER_USAGE_LIMIT to cause reuses with the smallest table --- possible. +-- Use the minimum BUFFER_USAGE_LIMIT to cause reuses or evictions with the +-- smallest table possible. VACUUM (PARALLEL 0, BUFFER_USAGE_LIMIT 128) test_io_vac_strategy; SELECT pg_stat_force_next_flush(); pg_stat_force_next_flush @@ -1409,13 +1413,19 @@ SELECT pg_stat_force_next_flush(); (1 row) -SELECT sum(reuses) AS reuses, sum(reads) AS reads +SELECT sum(reuses) AS reuses, sum(reads) AS reads, sum(evictions) AS evictions FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_after_ -SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads, - :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses; - ?column? | ?column? ---+-- - t| t +SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads; + ?column? +-- + t +(1 row) + +SELECT (:io_sum_vac_strategy_after_reuses +