Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Mon, Jun 10, 2024 at 03:39:34PM +0900, Michael Paquier wrote: > On Mon, Jun 10, 2024 at 06:29:17AM +, Bertrand Drouvot wrote: > > Thanks for the report! I think it makes sense to add it to the list of known > > failures. > > > > One way to deal with those corner cases could be to make use of injection > > points > > around places where RUNNING_XACTS is emitted, thoughts? > > Ah. You mean to force a wait in the code path generating the standby > snapshots for the sake of this test? Yeah. > That's interesting, I like it. Great, will look at it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
On Mon, Jun 10, 2024 at 06:29:17AM +, Bertrand Drouvot wrote: > Thanks for the report! I think it makes sense to add it to the list of known > failures. > > One way to deal with those corner cases could be to make use of injection > points > around places where RUNNING_XACTS is emitted, thoughts? Ah. You mean to force a wait in the code path generating the standby snapshots for the sake of this test? That's interesting, I like it. -- Michael signature.asc Description: PGP signature
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi Alexander, On Sat, Jun 08, 2024 at 07:00:00AM +0300, Alexander Lakhin wrote: > Hello Bertrand and Michael, > > 23.01.2024 11:07, Bertrand Drouvot wrote: > > On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote: > > > > > Anyway, that's not the end of it. What should we do for snapshot > > > snapshot records coming from the bgwriter? > > What about? > > > > 3) depending on how stabilized this test (and others that suffer from > > "random" > > xl_running_xacts) is, then think about the bgwriter. > > A recent buildfarm failure [1] reminds me of that remaining question. > > It's hard to determine from this info, why row_removal_activeslot was not > invalidated, but running this test on a slowed down Windows VM, I (still) > get the same looking failures caused by RUNNING_XACTS appeared just before > `invalidating obsolete replication slot "row_removal_inactiveslot"`. > So I would consider this failure as yet another result of bgwriter activity > and add it to the list of known failures as such... Thanks for the report! I think it makes sense to add it to the list of known failures. One way to deal with those corner cases could be to make use of injection points around places where RUNNING_XACTS is emitted, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hello Bertrand and Michael, 23.01.2024 11:07, Bertrand Drouvot wrote: On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote: Anyway, that's not the end of it. What should we do for snapshot snapshot records coming from the bgwriter? What about? 3) depending on how stabilized this test (and others that suffer from "random" xl_running_xacts) is, then think about the bgwriter. A recent buildfarm failure [1] reminds me of that remaining question. Here we have a slow machine (a successful run, for example [2], shows 541.13s duration of the test) and the following information logged: [13:55:13.725](34.411s) ok 25 - inactiveslot slot invalidation is logged with vacuum on pg_class [13:55:13.727](0.002s) not ok 26 - activeslot slot invalidation is logged with vacuum on pg_class [13:55:13.728](0.001s) # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' # at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229. [14:27:42.995](1949.267s) # poll_query_until timed out executing this query: # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' # expecting this output: # t # last actual query output: # f # with stderr: [14:27:42.999](0.004s) not ok 27 - confl_active_logicalslot updated [14:27:43.000](0.001s) # Failed test 'confl_active_logicalslot updated' # at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235. Timed out waiting confl_active_logicalslot to be updated at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235. --- 035_standby_logical_decoding_standby.log: 2024-06-06 13:55:07.715 UTC [9172:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot" 2024-06-06 13:55:07.715 UTC [9172:8] DETAIL: The slot conflicted with xid horizon 754. 2024-06-06 13:55:07.715 UTC [9172:9] CONTEXT: WAL redo at 0/4020A80 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 754, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 1, nunused: 0, dead: [48]; blkref #0: rel 1663/16384/2610, blk 0 2024-06-06 13:55:14.372 UTC [7532:1] [unknown] LOG: connection received: host=127.0.0.1 port=55328 2024-06-06 13:55:14.381 UTC [7532:2] [unknown] LOG: connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi (C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_hba.conf:2) 2024-06-06 13:55:14.381 UTC [7532:3] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=035_standby_logical_decoding.pl 2024-06-06 13:55:14.443 UTC [7532:4] 035_standby_logical_decoding.pl LOG: statement: select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' 2024-06-06 13:55:14.452 UTC [7532:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.090 user=pgrunner database=postgres host=127.0.0.1 port=55328 # (there is no `invalidating obsolete replication slot "row_removal_activeslot"` message) ... 2024-06-06 14:27:42.675 UTC [4032:4] 035_standby_logical_decoding.pl LOG: statement: select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' 2024-06-06 14:27:42.681 UTC [4032:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.080 user=pgrunner database=postgres host=127.0.0.1 port=58713 2024-06-06 14:27:43.095 UTC [7892:2] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. It's hard to determine from this info, why row_removal_activeslot was not invalidated, but running this test on a slowed down Windows VM, I (still) get the same looking failures caused by RUNNING_XACTS appeared just before `invalidating obsolete replication slot "row_removal_inactiveslot"`. So I would consider this failure as yet another result of bgwriter activity and add it to the list of known failures as such... [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2024-06-06%2012%3A36%3A11 [2] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo=2024-06-05%2017%3A03%3A13=misc-check Best regards, Alexander
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote: > On Mon, Jan 22, 2024 at 09:07:45AM +, Bertrand Drouvot wrote: > > I've rewritten some of that, and applied the patch down to v16. Thanks! > > Yeah, I can clearly see how this patch helps from a theoritical perspective > > but > > rely on Alexander's testing to see how it actually stabilizes the test. > > Anyway, that's not the end of it. What should we do for snapshot > snapshot records coming from the bgwriter? I've mixed feeling about it. On one hand we'll decrease even more the risk of seeing a xl_running_xacts in the middle of a test, but on the other hand I agree with Tom's concern [1]: I think that we could miss "corner cases/bug" detection (like the one reported in [2]). What about? 1) Apply "wait_until_vacuum_can_remove() + autovacuum disabled" where it makes sense and for tests that suffers from random xl_running_xacts. I can look at 031_recovery_conflict.pl, do you have others in mind? 2) fix [2] 3) depending on how stabilized this test (and others that suffer from "random" xl_running_xacts) is, then think about the bgwriter. [1]: https://www.postgresql.org/message-id/1375923.1705291719%40sss.pgh.pa.us [2]: https://www.postgresql.org/message-id/flat/zatjw2xh+tquc...@ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
On Mon, Jan 22, 2024 at 09:07:45AM +, Bertrand Drouvot wrote: > On Mon, Jan 22, 2024 at 03:54:44PM +0900, Michael Paquier wrote: >> Also, wouldn't it be better to document in the test why >> txid_current_snapshot() is chosen? Contrary to txid_current(), it >> does not generate a Transaction/COMMIT to make the test more >> predictible, something you have mentioned upthread, and implied in the >> test. > > Good point, added more comments in v8 (but not mentioning txid_current() as > after giving more thought about it then I think it was not the right approach > in > any case). Fine by me. >> - INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal >> >> This removes two INSERTs on flush_wal and refactors the code to do the >> INSERT in wait_until_vacuum_can_remove(), using a SQL comment to >> document a reference about the reason why an INSERT is used. Couldn't >> you just use a normal comment here? > > Agree, done in v8. I've rewritten some of that, and applied the patch down to v16. Let's see how this stabilizes things, but that's likely going to take some time as it depends on skink's mood. >> I need to review what you have more thoroughly, but is it OK to assume >> that both of you are happy with the latest version of the patch in >> terms of stability gained? That's still not the full picture, still a >> good step towards that. > > Yeah, I can clearly see how this patch helps from a theoritical perspective > but > rely on Alexander's testing to see how it actually stabilizes the test. Anyway, that's not the end of it. What should we do for snapshot snapshot records coming from the bgwriter? The slower the system, the higher the odds of hitting a conflict with such records, even if the horizon check should help. -- Michael signature.asc Description: PGP signature
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Mon, Jan 22, 2024 at 03:54:44PM +0900, Michael Paquier wrote: > On Fri, Jan 19, 2024 at 09:03:01AM +, Bertrand Drouvot wrote: > > On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote: > +# Launch $sql and wait for a new snapshot that has a newer horizon before > +# doing the vacuum with $vac_option on $to_vac. > +sub wait_until_vacuum_can_remove > > This had better document what the arguments of this routine are, > because that's really unclear. $to_vac is the relation that will be > vacuumed, for one. Agree, done that way in v8 attached. > Also, wouldn't it be better to document in the test why > txid_current_snapshot() is chosen? Contrary to txid_current(), it > does not generate a Transaction/COMMIT to make the test more > predictible, something you have mentioned upthread, and implied in the > test. Good point, added more comments in v8 (but not mentioning txid_current() as after giving more thought about it then I think it was not the right approach in any case). > > - INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal > > This removes two INSERTs on flush_wal and refactors the code to do the > INSERT in wait_until_vacuum_can_remove(), using a SQL comment to > document a reference about the reason why an INSERT is used. Couldn't > you just use a normal comment here? Agree, done in v8. > >> I've re-tested the v6 patch today and confirmed that it makes the test > >> more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in > >> parallel and got failures ('inactiveslot slot invalidation is logged with > >> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied. > >> (With unlimited CPU, when average test duration is around 70 seconds.) > >> > >> But with v6 applied, 60 iterations succeeded. > > > > Nice! Thanks for the testing! > > I need to review what you have more thoroughly, but is it OK to assume > that both of you are happy with the latest version of the patch in > terms of stability gained? That's still not the full picture, still a > good step towards that. Yeah, I can clearly see how this patch helps from a theoritical perspective but rely on Alexander's testing to see how it actually stabilizes the test. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 1380408965d8c7cb1a2e63043af7d1b8033a7e89 Mon Sep 17 00:00:00 2001 From: bdrouvot Date: Tue, 9 Jan 2024 05:08:35 + Subject: [PATCH v8] Fix 035_standby_logical_decoding.pl race condition We want to ensure that vacuum was able to remove dead rows (aka no other transactions holding back global xmin) before testing for slots invalidation on the standby. While at it, also fixing some typos/bad test description in it. --- .../t/035_standby_logical_decoding.pl | 78 +++ 1 file changed, 45 insertions(+), 33 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 8bc39a5f03..6664405772 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -238,6 +238,35 @@ sub check_for_invalidation ) or die "Timed out waiting confl_active_logicalslot to be updated"; } +# Launch $sql and wait for a new snapshot that has a newer horizon before +# doing the vacuum. Arguments are: $vac_option (the option to be passed to the +# vacuum command), $sql (the sql to launch before triggering the vacuum) and +# $to_vac (the relation to vacuum). +# Note that to get the horizon we're using pg_current_snapshot() (it does not +# generate a Transaction/COMMIT wal record, so we don't increase the risk to see +# a xl_running_xacts that would advance active replication slot's catalog_xmin). +# Indeed advancing the active replication slot's catalog_xmin would break some +# tests that expect the active slot to conflict with the catalog xmin horizon. +sub wait_until_vacuum_can_remove +{ + my ($vac_option, $sql, $to_vac) = @_; + + # Get the current xid horizon + my $xid_horizon = $node_primary->safe_psql('testdb', qq[select pg_snapshot_xmin(pg_current_snapshot());]); + + # Launch our sql + $node_primary->safe_psql('testdb', qq[$sql]); + + # Wait until we get a newer horizon + $node_primary->poll_query_until('testdb', + "SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0") + or die "new snapshot does not have a newer horizon"; + + # Launch the vacuum command and insert into flush_wal (see create table + # flush_wal for the reason why). + $node_primary->safe_psql('testdb', qq[VACUUM $vac_option verbose $to_vac; + INSERT INTO flush_wal DEFAULT VALUES;]); +} # Initialize primary node @@ -248,6 +277,7 @@ $node_primary->append_conf( wal_level = 'logical' max_replication_slots = 4 max_wal_senders = 4
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
On Fri, Jan 19, 2024 at 09:03:01AM +, Bertrand Drouvot wrote: > On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote: >> 15.01.2024 12:00, Alexander Lakhin wrote: >>> If this approach looks promising to you, maybe we could add a submodule to >>> perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in >>> 019_replslot_limit) as well. >>> >>> Personally I think that having such a functionality for using in tests >>> might be useful not only to avoid some "problematic" behaviour but also to >>> test the opposite cases. >> >> After spending a few days on it, I've discovered two more issues: >> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com >> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com >> >> (The latter is not related to bgwriter directly, but it was discovered >> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.) >> >> So it becomes clear that the 035 test is not the only one, which might >> suffer from bgwriter's activity, > > Yeah... thanks for sharing! > >> and inventing a way to stop bgwriter/ >> control it is a different subject, getting out of scope of the current >> issue. > > Agree. > >> 15.01.2024 11:49, Bertrand Drouvot wrote: >> Maybe it would be a right move to commit the fix, and then think about >> more rare failures. > > +1 Yeah, agreed to make things more stable before making them fancier. >> 2) Shall we align the 035_standby_logical_decoding with >> 031_recovery_conflict in regard to improving stability of vacuum? > > Yeah, I think that could make sense. Probably. That can always be done as a separate change, after a few runs of the slow buildfarm members able to reproduce the failure. >> I see the following options for this: >> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests; >> b) use FREEZE and autovacuum = off in both tests; >> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and >> autovacuum = off in both. > > I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not > sure using FREEZE would give full "stabilization predictability", at least for > 035_standby_logical_decoding.pl). That said I did not test what the outcome > would > be for 031_recovery_conflict.pl by making use of a). Yeah, I think I agree here with a), as v7 does for 035. +# Launch $sql and wait for a new snapshot that has a newer horizon before +# doing the vacuum with $vac_option on $to_vac. +sub wait_until_vacuum_can_remove This had better document what the arguments of this routine are, because that's really unclear. $to_vac is the relation that will be vacuumed, for one. Also, wouldn't it be better to document in the test why txid_current_snapshot() is chosen? Contrary to txid_current(), it does not generate a Transaction/COMMIT to make the test more predictible, something you have mentioned upthread, and implied in the test. - INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal This removes two INSERTs on flush_wal and refactors the code to do the INSERT in wait_until_vacuum_can_remove(), using a SQL comment to document a reference about the reason why an INSERT is used. Couldn't you just use a normal comment here? >> I've re-tested the v6 patch today and confirmed that it makes the test >> more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in >> parallel and got failures ('inactiveslot slot invalidation is logged with >> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied. >> (With unlimited CPU, when average test duration is around 70 seconds.) >> >> But with v6 applied, 60 iterations succeeded. > > Nice! Thanks for the testing! I need to review what you have more thoroughly, but is it OK to assume that both of you are happy with the latest version of the patch in terms of stability gained? That's still not the full picture, still a good step towards that. -- Michael signature.asc Description: PGP signature
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote: > Hello, > > 15.01.2024 12:00, Alexander Lakhin wrote: > > If this approach looks promising to you, maybe we could add a submodule to > > perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in > > 019_replslot_limit) as well. > > > > Personally I think that having such a functionality for using in tests > > might be useful not only to avoid some "problematic" behaviour but also to > > test the opposite cases. > > After spending a few days on it, I've discovered two more issues: > https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com > > (The latter is not related to bgwriter directly, but it was discovered > thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.) > > So it becomes clear that the 035 test is not the only one, which might > suffer from bgwriter's activity, Yeah... thanks for sharing! > and inventing a way to stop bgwriter/ > control it is a different subject, getting out of scope of the current > issue. Agree. > 15.01.2024 11:49, Bertrand Drouvot wrote: > > We did a few things in this thread, so to sum up what we've discovered: > > > > - a race condition in InvalidatePossiblyObsoleteSlot() (see [1]) > > - we need to launch the vacuum(s) only if we are sure we got a newer XID > > horizon > > ( proposal in in v6 attached) > > - we need a way to control how frequent xl_running_xacts are emmitted (to > > ensure > > they are not triggered in a middle of an active slot invalidation test). > > > > I'm not sure it's possible to address Tom's concern and keep the test > > "predictable". > > > > So, I think I'd vote for Michael's proposal to implement a > > superuser-settable > > developer GUC (as sending a SIGSTOP on the bgwriter (and bypass > > $windows_os) would > > still not address Tom's concern anyway). > > > > Another option would be to "sacrifice" the full predictablity of the test > > (in > > favor of real-world behavior testing)? > > > > [1]: > > https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal > > So, now we have the test 035 failing due to nondeterministic vacuum > activity in the first place, and due to bgwriter's activity in the second. Yeah, that's also my understanding. > Maybe it would be a right move to commit the fix, and then think about > more rare failures. +1 > Though I have a couple of question regarding the fix left, if you don't > mind: > 1) The test has minor defects in the comments, that I noted before [1]; > would you like to fix them in passing? > > > BTW, it looks like the comment: > > # One way to produce recovery conflict is to create/drop a relation and > > # launch a vacuum on pg_class with hot_standby_feedback turned off on the > > standby. > > in scenario 3 is a copy-paste error. Nice catch, thanks! Fixed in v7 attached. > > Also, there are two "Scenario 4" in this test. D'oh! Fixed in v7. > > > > 2) Shall we align the 035_standby_logical_decoding with > 031_recovery_conflict in regard to improving stability of vacuum? Yeah, I think that could make sense. > I see the following options for this: > a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests; > b) use FREEZE and autovacuum = off in both tests; > c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and > autovacuum = off in both. > I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not sure using FREEZE would give full "stabilization predictability", at least for 035_standby_logical_decoding.pl). That said I did not test what the outcome would be for 031_recovery_conflict.pl by making use of a). > I've re-tested the v6 patch today and confirmed that it makes the test > more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in > parallel and got failures ('inactiveslot slot invalidation is logged with > vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied. > (With unlimited CPU, when average test duration is around 70 seconds.) > > But with v6 applied, 60 iterations succeeded. Nice! Thanks for the testing! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From b45f0c55d17cddfac7c99d11000b819c8b09fa56 Mon Sep 17 00:00:00 2001 From: bdrouvot Date: Tue, 9 Jan 2024 05:08:35 + Subject: [PATCH v7] Fix 035_standby_logical_decoding.pl race condition We want to ensure that vacuum was able to remove dead rows (aka no other transactions holding back global xmin) before testing for slots invalidation on the standby. While at it, also fixing some typos/bad test description in it. --- .../t/035_standby_logical_decoding.pl | 68 ++- 1 file changed, 35 insertions(+), 33 deletions(-) 100.0% src/test/recovery/t/ diff --git
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hello, 15.01.2024 12:00, Alexander Lakhin wrote: If this approach looks promising to you, maybe we could add a submodule to perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in 019_replslot_limit) as well. Personally I think that having such a functionality for using in tests might be useful not only to avoid some "problematic" behaviour but also to test the opposite cases. After spending a few days on it, I've discovered two more issues: https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com (The latter is not related to bgwriter directly, but it was discovered thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.) So it becomes clear that the 035 test is not the only one, which might suffer from bgwriter's activity, and inventing a way to stop bgwriter/ control it is a different subject, getting out of scope of the current issue. 15.01.2024 11:49, Bertrand Drouvot wrote: We did a few things in this thread, so to sum up what we've discovered: - a race condition in InvalidatePossiblyObsoleteSlot() (see [1]) - we need to launch the vacuum(s) only if we are sure we got a newer XID horizon ( proposal in in v6 attached) - we need a way to control how frequent xl_running_xacts are emmitted (to ensure they are not triggered in a middle of an active slot invalidation test). I'm not sure it's possible to address Tom's concern and keep the test "predictable". So, I think I'd vote for Michael's proposal to implement a superuser-settable developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would still not address Tom's concern anyway). Another option would be to "sacrifice" the full predictablity of the test (in favor of real-world behavior testing)? [1]: https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal So, now we have the test 035 failing due to nondeterministic vacuum activity in the first place, and due to bgwriter's activity in the second. Maybe it would be a right move to commit the fix, and then think about more rare failures. Though I have a couple of question regarding the fix left, if you don't mind: 1) The test has minor defects in the comments, that I noted before [1]; would you like to fix them in passing? BTW, it looks like the comment: # One way to produce recovery conflict is to create/drop a relation and # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby. in scenario 3 is a copy-paste error. Also, there are two "Scenario 4" in this test. 2) Shall we align the 035_standby_logical_decoding with 031_recovery_conflict in regard to improving stability of vacuum? I see the following options for this: a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests; b) use FREEZE and autovacuum = off in both tests; c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and autovacuum = off in both. I've re-tested the v6 patch today and confirmed that it makes the test more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in parallel and got failures ('inactiveslot slot invalidation is logged with vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied. (With unlimited CPU, when average test duration is around 70 seconds.) But with v6 applied, 60 iterations succeeded. [1] https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b...@gmail.com Best regards, Alexander
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hello Michael and Bertrand, 15.01.2024 06:59, Michael Paquier wrote: The WAL records related to standby snapshots are playing a lot with the randomness of the failures we are seeing. Alexander has mentioned offlist something else: using SIGSTOP on the bgwriter to avoid these records and make the test more stable. That would not be workable for Windows, but I could live with that knowing that logical decoding for standbys has no platform-speficic tweak for the code paths we're testing here, and that would put as limitation to skip the test for $windows_os. I've found a way to implement pause/resume for Windows processed and it looks acceptable to me if we can afford "use Win32::API;" on Windows (maybe the test could be skipped only if this perl module is absent). Please look at the PoC patch for the test 035_standby_logical_decoding. (The patched test passes for me.) If this approach looks promising to you, maybe we could add a submodule to perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in 019_replslot_limit) as well. Personally I think that having such a functionality for using in tests might be useful not only to avoid some "problematic" behaviour but also to test the opposite cases. While thinking about that, a second idea came into my mind: a superuser-settable developer GUC to disable such WAL records to be generated within certain areas of the test. This requires a small implementation, but nothing really huge, while being portable everywhere. And it is not the first time I've been annoyed with these records when wanting a predictible set of WAL records for some test case. I see that the test in question exists in REL_16_STABLE, it means that a new GUC would not help there? Best regards, Alexanderdiff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 8bc39a5f03..8b08c7b5c7 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -10,6 +10,8 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; +use Win32::API; + my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $subscriber_stdin, $subscriber_stdout, $subscriber_stderr, $ret, @@ -28,6 +30,28 @@ my $res; my $primary_slotname = 'primary_physical'; my $standby_physical_slotname = 'standby_physical'; +my $OpenProcess = new Win32::API("kernel32", "HANDLE OpenProcess(DWORD dwDesiredAccess, BOOL bInheritHandle, DWORD dwProcessId )"); +my $NtSuspendProcess = new Win32::API("ntdll", 'LONG NtSuspendProcess(HANDLE hProcess)'); +my $NtResumeProcess = new Win32::API("ntdll", 'LONG NtResumeProcess(HANDLE hProcess)'); +my $CloseHandle = new Win32::API("kernel32", "BOOL CloseHandle(HANDLE hObject)"); +my $PROCESS_ALL_ACCESS = 0x001F0FFF; + +sub suspend_process +{ + my $pid = shift; + my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid); + $NtSuspendProcess->Call($hProcess); + $CloseHandle->Call($hProcess); +} + +sub resume_process +{ + my $pid = shift; + my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid); + $NtResumeProcess->Call($hProcess); + $CloseHandle->Call($hProcess); +} + # Fetch xmin columns from slot's pg_replication_slots row, after waiting for # given boolean condition to be true to ensure we've reached a quiescent state. sub wait_for_xmins @@ -456,6 +480,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby'); $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub"); $node_subscriber->stop; +my $bgwriterpid = $node_primary->safe_psql('postgres', "SELECT pid FROM pg_stat_activity WHERE backend_type = 'background writer'"); +suspend_process($bgwriterpid); + ## # Recovery conflict: Invalidate conflicting slots, including in-use slots # Scenario 1: hot_standby_feedback off and vacuum FULL @@ -690,6 +717,7 @@ $node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]); $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]); $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]); $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]); +$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(25);]); $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]); $node_primary->wait_for_replay_catchup($node_standby); @@ -709,6 +737,8 @@ check_pg_recvlogical_stderr($handle, # Turn hot_standby_feedback back on change_hot_standby_feedback_and_wait_for_xmins(1, 1); +resume_process($bgwriterpid); + ## # Recovery conflict: Invalidate conflicting slots, including in-use slots # Scenario 5: incorrect wal_level on primary.
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Mon, Jan 15, 2024 at 01:11:26PM +0900, Michael Paquier wrote: > On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote: > > Michael Paquier writes: > >> While thinking about that, a second idea came into my mind: a > >> superuser-settable developer GUC to disable such WAL records to be > >> generated within certain areas of the test. This requires a small > >> implementation, but nothing really huge, while being portable > >> everywhere. And it is not the first time I've been annoyed with these > >> records when wanting a predictible set of WAL records for some test > >> case. > > > > Hmm ... I see what you are after, but to what extent would this mean > > that what we are testing is not our real-world behavior? > > Don't think so. We don't care much about these records when it comes > to checking slot invalidation scenarios with a predictible XID > horizon, AFAIK. Yeah, we want to test slot invalidation behavior so we need to ensure that such an invalidation occur (which is not the case if we get a xl_running_xacts in the middle) at the first place. OTOH I also see Tom's point: for example I think we'd not have discovered [1] (outside from the field) with such a developer GUC in place. We did a few things in this thread, so to sum up what we've discovered: - a race condition in InvalidatePossiblyObsoleteSlot() (see [1]) - we need to launch the vacuum(s) only if we are sure we got a newer XID horizon ( proposal in in v6 attached) - we need a way to control how frequent xl_running_xacts are emmitted (to ensure they are not triggered in a middle of an active slot invalidation test). I'm not sure it's possible to address Tom's concern and keep the test "predictable". So, I think I'd vote for Michael's proposal to implement a superuser-settable developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would still not address Tom's concern anyway). Another option would be to "sacrifice" the full predictablity of the test (in favor of real-world behavior testing)? [1]: https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From a35a308626b6b61c3994531cbf89fe835f4842c2 Mon Sep 17 00:00:00 2001 From: bdrouvot Date: Tue, 9 Jan 2024 05:08:35 + Subject: [PATCH v6] Fix 035_standby_logical_decoding.pl race condition We want to ensure that vacuum was able to remove dead rows (aka no other transactions holding back global xmin) before testing for slots invalidation on the standby. --- .../t/035_standby_logical_decoding.pl | 59 ++- 1 file changed, 30 insertions(+), 29 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 8bc39a5f03..9bfa8833b5 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -238,6 +238,25 @@ sub check_for_invalidation ) or die "Timed out waiting confl_active_logicalslot to be updated"; } +# Launch $sql and wait for a new snapshot that has a newer horizon before +# doing the vacuum with $vac_option on $to_vac. +sub wait_until_vacuum_can_remove +{ + my ($vac_option, $sql, $to_vac) = @_; + + # get the current xid horizon + my $xid_horizon = $node_primary->safe_psql('testdb', qq[select pg_snapshot_xmin(pg_current_snapshot());]); + + # launch our sql + $node_primary->safe_psql('testdb', qq[$sql]); + + $node_primary->poll_query_until('testdb', + "SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0") + or die "new snapshot does not have a newer horizon"; + + $node_primary->safe_psql('testdb', qq[VACUUM $vac_option verbose $to_vac; + INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal;]); +} # Initialize primary node @@ -248,6 +267,7 @@ $node_primary->append_conf( wal_level = 'logical' max_replication_slots = 4 max_wal_senders = 4 +autovacuum = off }); $node_primary->dump_info; $node_primary->start; @@ -468,13 +488,8 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_', 0, 1); # This should trigger the conflict -$node_primary->safe_psql( - 'testdb', qq[ - CREATE TABLE conflict_test(x integer, y text); - DROP TABLE conflict_test; - VACUUM full pg_class; - INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal -]); +wait_until_vacuum_can_remove('full', 'CREATE TABLE conflict_test(x integer, y text); + DROP TABLE conflict_test;', 'pg_class'); $node_primary->wait_for_replay_catchup($node_standby); @@ -550,13 +565,8 @@ reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_', 0, 1); # This should trigger the conflict -$node_primary->safe_psql( - 'testdb', qq[ - CREATE
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote: > Michael Paquier writes: >> While thinking about that, a second idea came into my mind: a >> superuser-settable developer GUC to disable such WAL records to be >> generated within certain areas of the test. This requires a small >> implementation, but nothing really huge, while being portable >> everywhere. And it is not the first time I've been annoyed with these >> records when wanting a predictible set of WAL records for some test >> case. > > Hmm ... I see what you are after, but to what extent would this mean > that what we are testing is not our real-world behavior? Don't think so. We don't care much about these records when it comes to checking slot invalidation scenarios with a predictible XID horizon, AFAIK. -- Michael signature.asc Description: PGP signature
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Michael Paquier writes: > While thinking about that, a second idea came into my mind: a > superuser-settable developer GUC to disable such WAL records to be > generated within certain areas of the test. This requires a small > implementation, but nothing really huge, while being portable > everywhere. And it is not the first time I've been annoyed with these > records when wanting a predictible set of WAL records for some test > case. Hmm ... I see what you are after, but to what extent would this mean that what we are testing is not our real-world behavior? regards, tom lane
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
On Fri, Jan 12, 2024 at 01:46:08PM +, Bertrand Drouvot wrote: > 1) Michael's proposal up-thread (means tweak the test with a retry logic, > retrying > things if such a standby snapshot is found). > > 2) Don't report a test error for active slots in case its catalog_xmin > advanced. > > I'd vote for 2) as: > > - this is a corner case and the vast majority of the animals don't report any > issues (means the active slot conflict detection is already well covered). > > - even on the same animal it should be pretty rare to not have an active slot > conflict detection not covered at all (and the "failing" one would be probably > moving over time). > > - It may be possible that 1) ends up failing (as we'd need to put a limit on > the > retry logic anyhow). > > What do you think? > > And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's > fully correct, so I'll give it another look. The WAL records related to standby snapshots are playing a lot with the randomness of the failures we are seeing. Alexander has mentioned offlist something else: using SIGSTOP on the bgwriter to avoid these records and make the test more stable. That would not be workable for Windows, but I could live with that knowing that logical decoding for standbys has no platform-speficic tweak for the code paths we're testing here, and that would put as limitation to skip the test for $windows_os. While thinking about that, a second idea came into my mind: a superuser-settable developer GUC to disable such WAL records to be generated within certain areas of the test. This requires a small implementation, but nothing really huge, while being portable everywhere. And it is not the first time I've been annoyed with these records when wanting a predictible set of WAL records for some test case. Another possibility would be to move these records elsewhere, outside of the bgwriter, but we need such records at a good frequency for the availability of read-only standbys. And surely we'd want an on/off switch anyway to get a full control for test sequences. -- Michael signature.asc Description: PGP signature
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Fri, Jan 12, 2024 at 02:00:01PM +0300, Alexander Lakhin wrote: > Hi, > > 12.01.2024 10:15, Bertrand Drouvot wrote: > > > > For this one, the "good" news is that it looks like that we don’t see the > > "terminating" message not followed by an "obsolete" message (so the engine > > behaves correctly) anymore. > > > > There is simply nothing related to the row_removal_activeslot at all (the > > catalog_xmin > > advanced and there is no conflict). > > Yes, judging from all the failures that we see now, it looks like the > 0001-Fix-race-condition...patch works as expected. Yeah, thanks for confirming, I'll create a dedicated hackers thread for that one. > > Let's give v5 a try? (please apply > > v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch > > too). > > Unfortunately, I've got the failure again (please see logs attached). > (_primary.log can confirm that I have used exactly v5 — I see no > txid_current() calls there...) Okay ;-( Thanks for the testing. Then I can think of: 1) Michael's proposal up-thread (means tweak the test with a retry logic, retrying things if such a standby snapshot is found). 2) Don't report a test error for active slots in case its catalog_xmin advanced. I'd vote for 2) as: - this is a corner case and the vast majority of the animals don't report any issues (means the active slot conflict detection is already well covered). - even on the same animal it should be pretty rare to not have an active slot conflict detection not covered at all (and the "failing" one would be probably moving over time). - It may be possible that 1) ends up failing (as we'd need to put a limit on the retry logic anyhow). What do you think? And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's fully correct, so I'll give it another look. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Fri, Jan 12, 2024 at 07:01:55AM +0900, Michael Paquier wrote: > On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote: > > Bertrand, I've relaunched tests in the same slowed down VM with both > > patches applied (but with no other modifications) and got a failure > > with pg_class, similar to what we had seen before: > > 9 # Failed test 'activeslot slot invalidation is logged with vacuum > > on pg_class' > > 9 # at t/035_standby_logical_decoding.pl line 230. > > > > Please look at the logs attached (I see there Standby/RUNNING_XACTS near > > 'invalidating obsolete replication slot "row_removal_inactiveslot"'). Thanks! For this one, the "good" news is that it looks like that we don’t see the "terminating" message not followed by an "obsolete" message (so the engine behaves correctly) anymore. There is simply nothing related to the row_removal_activeslot at all (the catalog_xmin advanced and there is no conflict). And I agree that this is due to the Standby/RUNNING_XACTS that is "advancing" the catalog_xmin of the active slot. > Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level > = minimal, because these lead to unpredictible records inserted, > impacting the reliability of the tests. We cannot do that here, > obviously. That may be a long shot, but could it be possible to tweak > the test with a retry logic, retrying things if such a standby > snapshot is found because we know that the invalidation is not going > to work anyway? I think it all depends what the xl_running_xacts does contain (means does it "advance" or not the catalog_xmin in our case). In our case it does advance it (should it occurs) due to the "select txid_current()" that is done in wait_until_vacuum_can_remove() in 035_standby_logical_decoding.pl. I suggest to make use of txid_current_snapshot() instead (that does not produce a Transaction/COMMIT wal record, as opposed to txid_current()). I think that it could be "enough" for our case here, and it's what v5 attached is now doing. Let's give v5 a try? (please apply v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch too). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From b1c316a85a2dd43a5ab33adbda1ee82020d84bb1 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Thu, 11 Jan 2024 13:36:57 + Subject: [PATCH v5] Fix 035_standby_logical_decoding.pl race conditions We want to ensure that vacuum was able to remove dead rows (aka no other transactions holding back global xmin) before testing for slots invalidation on the standby. Also, get rid of the active slot invalidation check during the on-access pruning check. This test is racy for active slots and active slots invalidations are well covered in other tests. --- .../t/035_standby_logical_decoding.pl | 114 +- 1 file changed, 59 insertions(+), 55 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 8bc39a5f03..dd4149c8bc 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -191,9 +191,9 @@ sub check_slots_conflict_reason # Drop the slots, re-create them, change hot_standby_feedback, # check xmin and catalog_xmin values, make slot active and reset stat. -sub reactive_slots_change_hfs_and_wait_for_xmins +sub recreate_slots_change_hfs_and_wait_for_xmins { - my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_; + my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated, $activate) = @_; # drop the logical slots drop_logical_slots($previous_slot_prefix); @@ -203,8 +203,11 @@ sub reactive_slots_change_hfs_and_wait_for_xmins change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated); - $handle = - make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr); + if ($activate) + { + $handle = + make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr); + } # reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts $node_standby->psql('testdb', q[select pg_stat_reset();]); @@ -213,7 +216,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins # Check invalidation in the logfile and in pg_stat_database_conflicts sub check_for_invalidation { - my ($slot_prefix, $log_start, $test_name) = @_; + my ($slot_prefix, $log_start, $test_name, $activated) = @_; my $active_slot = $slot_prefix . 'activeslot'; my $inactive_slot = $slot_prefix . 'inactiveslot'; @@ -230,14 +233,33 @@ sub check_for_invalidation "activeslot slot invalidation is logged $test_name"); # Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated - ok( $node_standby->poll_query_until( - 'postgres', - "select (confl_active_logicalslot = 1) from
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote: > Bertrand, I've relaunched tests in the same slowed down VM with both > patches applied (but with no other modifications) and got a failure > with pg_class, similar to what we had seen before: > 9 # Failed test 'activeslot slot invalidation is logged with vacuum > on pg_class' > 9 # at t/035_standby_logical_decoding.pl line 230. > > Please look at the logs attached (I see there Standby/RUNNING_XACTS near > 'invalidating obsolete replication slot "row_removal_inactiveslot"'). Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level = minimal, because these lead to unpredictible records inserted, impacting the reliability of the tests. We cannot do that here, obviously. That may be a long shot, but could it be possible to tweak the test with a retry logic, retrying things if such a standby snapshot is found because we know that the invalidation is not going to work anyway? -- Michael signature.asc Description: PGP signature
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Thu, Jan 11, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote: > Hi Bertrand, > > 10.01.2024 19:32, Bertrand Drouvot wrote: > > > > > is an absent message 'obsolete replication slot "row_removal_activeslot"' > > Looking at > > recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here: > > > > Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot(). > > > > In case of an active slot we first call ReportSlotInvalidation with the > > second > > parameter set to true (to emit the "terminating" message), then SIGTERM the > > active > > process and then (later) we should call the other ReportSlotInvalidation() > > call with the second parameter set to false (to emit the message that we > > don't > > see here). > > > > So it means InvalidatePossiblyObsoleteSlot() did not trigger the second > > ReportSlotInvalidation() > > call. > > I've found a way to reproduce the issue without slowing down a machine or > running multiple tests in parallel. It's enough for this to add a delay to > allow BackgroundWriterMain() to execute LogStandbySnapshot(): > @@ -692,2 +690,3 @@ > $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]); > +$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]); > $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]); > > With this delay, I get the failure immediately: > $ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C > src/test/recovery > # +++ tap check in src/test/recovery +++ > t/035_standby_logical_decoding.pl .. 47/? > # Failed test 'activeslot slot invalidation is logged with on-access > pruning' > # at t/035_standby_logical_decoding.pl line 227. Thanks a lot for the testing! So I think we have 2 issues here: 1) The one you're mentioning above related to the on-access pruning test: I think the engine behavior is right here and that the test is racy. I'm proposing to bypass the active slot invalidation check for this particular test ( as I don't see any "easy" way to take care of this race condition). The active slot invalidation is already well covered in the other tests anyway. I'm proposing the attached v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch for it. 2) The fact that sometime we're getting a termination message which is not followed by an obsolete one (like as discussed in [1]). For this one, I think that InvalidatePossiblyObsoleteSlot() is racy: In case of an active slot we proceed in 2 steps: - terminate the backend holding the slot - report the slot as obsolete This is racy because between the two we release the mutex on the slot, which means the effective_xmin and effective_catalog_xmin could advance during that time. I'm proposing the attached v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch for it. Would it be possible to re-launch your repro (the slow one, not the pg_sleep() one) with bot patch applied and see how it goes? (Please note that v4 replaces v3 that you're already using in your tests). If it helps, I'll propose v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch into a dedicated hackers thread. [1]: https://www.postgresql.org/message-id/ZZ7GpII4bAYN%2BjT5%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 8347851b3cc42655cfd81914f0c2f5cc1d22e2b8 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Thu, 11 Jan 2024 13:36:57 + Subject: [PATCH v4] Fix 035_standby_logical_decoding.pl race conditions We want to ensure that vacuum was able to remove dead rows (aka no other transactions holding back global xmin) before testing for slots invalidation on the standby. Also, get rid of the active slot invalidation check during the on-access pruning check. This test is racy for active slots and active slots invalidations are well covered in other tests. --- .../t/035_standby_logical_decoding.pl | 114 +- 1 file changed, 59 insertions(+), 55 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 8bc39a5f03..7a50187326 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -191,9 +191,9 @@ sub check_slots_conflict_reason # Drop the slots, re-create them, change hot_standby_feedback, # check xmin and catalog_xmin values, make slot active and reset stat. -sub reactive_slots_change_hfs_and_wait_for_xmins +sub recreate_slots_change_hfs_and_wait_for_xmins { - my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_; + my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated, $activate) = @_; # drop the logical slots drop_logical_slots($previous_slot_prefix); @@ -203,8 +203,11 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Wed, Jan 10, 2024 at 05:00:01PM +0300, Alexander Lakhin wrote: > 10.01.2024 12:46, Bertrand Drouvot wrote: > > > Would it be possible to also send the standby logs? > > Yes, please look at the attached logs. This time I've build postgres with > -DWAL_DEBUG and ran tests with TEMP_CONFIG as below: > wal_keep_size=1GB > wal_debug = on > log_autovacuum_min_duration = 0 > log_statement = 'all' > log_min_messages = INFO > > The archive attached contains logs from four runs: > recovery-1-ok -- an example of successful run for reference > recovery-7-pruning and recovery-19-pruning -- failures with a failed > subtest 'activeslot slot invalidation is logged with on-access pruning' > recovery-15-vacuum_pg_class -- a failure with a failed > subtest 'activeslot slot invalidation is logged with vacuum on pg_class' Thanks a lot for the testing! > is an absent message 'obsolete replication slot "row_removal_activeslot"' Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here: Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot(). In case of an active slot we first call ReportSlotInvalidation with the second parameter set to true (to emit the "terminating" message), then SIGTERM the active process and then (later) we should call the other ReportSlotInvalidation() call with the second parameter set to false (to emit the message that we don't see here). So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation() call. The thing it that it looks like we exited the loop in InvalidatePossiblyObsoleteSlot() because there is more messages from the startup process (789618) after the: " 2024-01-10 11:00:29.109 UTC [789618] LOG: terminating process 790377 to release replication slot "row_removal_activeslot" " one. Do you think you could try to add more debug messages in InvalidatePossiblyObsoleteSlot() to understand why the second call to ReportSlotInvalidation() is not done and IIUC why/how we exited the loop? FWIW, I did try to reproduce by launching pg_recvlogical and then kill -SIGSTOP it. Then producing a conflict, I'm able to get the first message and not the second one (which is expected). But the startup process does not exit the loop, which is expected here. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote: > Michael, it definitely increases stability of the test (tens of iterations > with 20 tests in parallel performed successfully), Thanks for testing! > although I've managed to > see another interesting failure (twice): > 13 # Failed test 'activeslot slot invalidation is logged with vacuum > on pg_class' > 13 # at t/035_standby_logical_decoding.pl line 227. > Looking at the attached log files and particularly 1/regress_log_035_standby_logical_decoding: " [11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with vacuum on pg_class [11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with vacuum on pg_class " That seems weird, the inactive slot has been invalidated while the active one is not. While it takes a bit longer to invalidate an active slot, I don't think the test can move on until both are invalidated (then leading to the tests 24 and 25)). I can see the tests are very slow to run (13.993s for 24) but still don't get how 24 could succeed while 25 does not. Looking at 2/regress_log_035_standby_logical_decoding: " [13:41:02.076](20.279s) ok 23 - inactiveslot slot invalidation is logged with vacuum on pg_class [13:41:02.076](0.000s) not ok 24 - activeslot slot invalidation is logged with vacuum on pg_class " Same "weird" behavior but this time the tests numbering are not the same (23 and 24). That is even more weird as those tests should be the 24 and 25 ones. Would it be possible to also send the standby logs? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote: > Thus just adding FREEZE is not enough, seemingly. It makes me wonder if > 0174c2d21 should be superseded by a patch like discussed (or just have > autovacuum = off added)... Adding an extra FREEZE offers an extra insurance, so I don't see why it would be a problem to add it to stabilize the horizon conflicts on the standbys. > 09.01.2024 07:59, Michael Paquier wrote: > Bertrand, thank you for updating the patch! > > Michael, it definitely increases stability of the test (tens of iterations > with 20 tests in parallel performed successfully), although I've managed to > see another interesting failure (twice): > 13 # Failed test 'activeslot slot invalidation is logged with vacuum > on pg_class' > 13 # at t/035_standby_logical_decoding.pl line 227. Something I'd like to confirm here: you still see this failure with the patch, but without an extra FREEZE, right? If we do both, the test would get more stable, wouldn't it? -- Michael signature.asc Description: PGP signature
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hello Michael and Bertrand, I'd also like to note that even with FREEZE added [1], I happened to see the test failure: 5 # Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class' 5 # at t/035_standby_logical_decoding.pl line 222. 5 5 # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' 5 # at t/035_standby_logical_decoding.pl line 227. where 035_standby_logical_decoding_primary.log contains: ... 2024-01-09 07:44:26.480 UTC [820142] 035_standby_logical_decoding.pl LOG: statement: DROP TABLE conflict_test; 2024-01-09 07:44:26.687 UTC [820142] 035_standby_logical_decoding.pl LOG: statement: VACUUM (VERBOSE, FREEZE) pg_class; 2024-01-09 07:44:26.687 UTC [820142] 035_standby_logical_decoding.pl INFO: aggressively vacuuming "testdb.pg_catalog.pg_class" 2024-01-09 07:44:27.099 UTC [820143] DEBUG: autovacuum: processing database "testdb" 2024-01-09 07:44:27.102 UTC [820142] 035_standby_logical_decoding.pl INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1 pages: 0 removed, 11 remain, 11 scanned (100.00% of total) tuples: 0 removed, 423 remain, 4 are dead but not yet removable removable cutoff: 762, which was 2 XIDs old when operation ended new relfrozenxid: 762, which is 2 XIDs ahead of previous value frozen: 1 pages from table (9.09% of total) had 1 tuples frozen Thus just adding FREEZE is not enough, seemingly. It makes me wonder if 0174c2d21 should be superseded by a patch like discussed (or just have autovacuum = off added)... 09.01.2024 07:59, Michael Paquier wrote: Alexander, does the test gain in stability once you begin using the patch posted on [2], mentioned by Bertrand? (Also, perhaps we'd better move the discussion to the other thread where the patch has been sent.) [2]: https://www.postgresql.org/message-id/d40d015f-03a4-1cf2-6c1f-2b9aca860...@gmail.com 09.01.2024 08:29, Bertrand Drouvot wrote: Alexander, pleae find attached v3 which is more or less a rebased version of it. Bertrand, thank you for updating the patch! Michael, it definitely increases stability of the test (tens of iterations with 20 tests in parallel performed successfully), although I've managed to see another interesting failure (twice): 13 # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' 13 # at t/035_standby_logical_decoding.pl line 227. psql::1: INFO: vacuuming "testdb.pg_catalog.pg_class" psql::1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1 pages: 0 removed, 11 remain, 11 scanned (100.00% of total) tuples: 4 removed, 419 remain, 0 are dead but not yet removable removable cutoff: 754, which was 0 XIDs old when operation ended ... Waiting for replication conn standby's replay_lsn to pass 0/403E6F8 on primary But I see no VACUUM records in WAL: rmgr: Transaction len (rec/tot): 222/ 222, tx: 0, lsn: 0/0403E468, prev 0/0403E370, desc: INVALIDATION ; inval msgs: catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 relcache 2662 relcache 2663 relcache 3455 relcache 1259 rmgr: Standby len (rec/tot): 234/ 234, tx: 0, lsn: 0/0403E548, prev 0/0403E468, desc: INVALIDATIONS ; relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 relcache 2662 relcache 2663 relcache 3455 relcache 1259 rmgr: Heap len (rec/tot): 60/ 140, tx: 754, lsn: 0/0403E638, prev 0/0403E548, desc: INSERT off: 2, flags: 0x08, blkref #0: rel 1663/16384/16385 blk 0 FPW rmgr: Transaction len (rec/tot): 46/ 46, tx: 754, lsn: 0/0403E6C8, prev 0/0403E638, desc: COMMIT 2024-01-09 13:40:59.873385 UTC rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0403E6F8, prev 0/0403E6C8, desc: RUNNING_XACTS nextXid 755 latestCompletedXid 754 oldestRunningXid 755 rmgr: XLOG len (rec/tot): 30/ 30, tx: 0, lsn: 0/0403E730, prev 0/0403E6F8, desc: CHECKPOINT_REDO rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0403E750, prev 0/0403E730, desc: RUNNING_XACTS nextXid 755 latestCompletedXid 754 oldestRunningXid 755 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/0403E788, prev 0/0403E750, desc: CHECKPOINT_ONLINE redo 0/403E730; tli 1; prev tli 1; fpw true; xid 0:755; oid 24576; multi 1; offset 0; oldest xid 728 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 755; online rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0403E800, prev 0/0403E788, desc: RUNNING_XACTS nextXid 755 latestCompletedXid 754 oldestRunningXid 755 (Full logs are attached.) [1] https://www.postgresql.org/message-id/4fd52508-54d7-0202-5bd3-546c2295967f%40gmail.com Best regards, Alexander
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi, On 5/30/23 12:34 PM, Drouvot, Bertrand wrote: Hi hackers, Please find attached a patch proposal to $SUBJECT. Indeed, we have seen occurrences in [1] that some slots were not invalidated (while we expected vacuum to remove dead rows leading to slots invalidation on the standby). Though we don't have strong evidences that this was due to transactions holding back global xmin (as vacuum did not run in verbose mode), suspicion is high enough (as Tom pointed out that the test is broken on its face (see [1])). The proposed patch: - set autovacuum = off on the primary (as autovacuum is the usual suspect for holding global xmin). - Ensure that vacuum is able to remove dead rows before launching the slots invalidation tests. - If after 10 attempts the vacuum is still not able to remove the dead rows then the slots invalidation tests are skipped: that should be pretty rare, as currently the majority of the tests are green (with only one attempt). While at it, the patch also addresses the nitpicks mentioned by Robert in [2]. [1]: https://www.postgresql.org/message-id/flat/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9%40OSZPR01MB6310.jpnprd01.prod.outlook.com#71898e088d2a57564a1bd9c41f3e6f36 [2]: https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com Please find attached V2 that, instead of the above proposal, waits for a new snapshot that has a newer horizon before doing the vacuum (as proposed by Andres in [1]). So, V2: - set autovacuum = off on the primary (as autovacuum is the usual suspect for holding global xmin). - waits for a new snapshot that has a newer horizon before doing the vacuum(s). - addresses the nitpicks mentioned by Robert in [2]. V2 also keeps the verbose mode for the vacuum(s) (as done in V1), as it may help for further analysis if needed. [1]: https://www.postgresql.org/message-id/20230530152426.ensapay7pozh7ozn%40alap3.anarazel.de [2]: https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comFrom 383bfcf39257d4542b35ffe2ab56b43182ca2dea Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Tue, 30 May 2023 07:54:02 + Subject: [PATCH v2] Ensure vacuum is able to remove dead rows in 035_standby_logical_decoding We want to ensure that vacuum was able to remove dead rows (aka no other transactions holding back global xmin) before testing for slots invalidation on the standby. --- .../t/035_standby_logical_decoding.pl | 76 +-- 1 file changed, 37 insertions(+), 39 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 64beec4bd3..bd426f3068 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -185,7 +185,7 @@ sub change_hot_standby_feedback_and_wait_for_xmins # Check conflicting status in pg_replication_slots. sub check_slots_conflicting_status { - my ($conflicting) = @_; + my ($conflicting, $details) = @_; if ($conflicting) { @@ -193,7 +193,7 @@ sub check_slots_conflicting_status 'postgres', qq( select bool_and(conflicting) from pg_replication_slots;)); - is($res, 't', "Logical slots are reported as conflicting"); + is($res, 't', "logical slots are reported as conflicting $details"); } else { @@ -201,7 +201,7 @@ sub check_slots_conflicting_status 'postgres', qq( select bool_or(conflicting) from pg_replication_slots;)); - is($res, 'f', "Logical slots are reported as non conflicting"); + is($res, 'f', "logical slots are reported as non conflicting $details"); } } @@ -256,6 +256,22 @@ sub check_for_invalidation ) or die "Timed out waiting confl_active_logicalslot to be updated"; } +# Launch $sql and wait for a new snapshot that has a newer horizon before +# doing the vacuum with $vac_option on $to_vac. +sub wait_until_vacuum_can_remove +{ + my ($vac_option, $sql, $to_vac) = @_; + + my $xid = $node_primary->safe_psql('testdb', qq[$sql + select txid_current();]); + + $node_primary->poll_query_until('testdb', + "SELECT (select txid_snapshot_xmin(txid_current_snapshot()) - $xid) > 0") + or die "new snapshot does not have a newer horizon"; + + $node_primary->safe_psql('testdb', qq[VACUUM $vac_option verbose $to_vac; + INSERT INTO flush_wal DEFAULT VALUES; --
Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hi hackers, Please find attached a patch proposal to $SUBJECT. Indeed, we have seen occurrences in [1] that some slots were not invalidated (while we expected vacuum to remove dead rows leading to slots invalidation on the standby). Though we don't have strong evidences that this was due to transactions holding back global xmin (as vacuum did not run in verbose mode), suspicion is high enough (as Tom pointed out that the test is broken on its face (see [1])). The proposed patch: - set autovacuum = off on the primary (as autovacuum is the usual suspect for holding global xmin). - Ensure that vacuum is able to remove dead rows before launching the slots invalidation tests. - If after 10 attempts the vacuum is still not able to remove the dead rows then the slots invalidation tests are skipped: that should be pretty rare, as currently the majority of the tests are green (with only one attempt). While at it, the patch also addresses the nitpicks mentioned by Robert in [2]. [1]: https://www.postgresql.org/message-id/flat/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9%40OSZPR01MB6310.jpnprd01.prod.outlook.com#71898e088d2a57564a1bd9c41f3e6f36 [2]: https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comFrom a5c6ef80cae1fb7606ff46679422238fdceb5cc8 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Tue, 30 May 2023 07:54:02 + Subject: [PATCH v1] Ensure vacuum did remove dead rows in 035_standby_logical_decoding We want to ensure that vacuum was able to remove dead rows (aka no other transactions holding back global xmin) before testing for slots invalidation on the standby. --- .../t/035_standby_logical_decoding.pl | 263 ++ 1 file changed, 150 insertions(+), 113 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 64beec4bd3..d26d9bcace 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -9,6 +9,7 @@ use warnings; use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; +use Time::HiRes qw(usleep); my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $subscriber_stdin, @@ -23,6 +24,9 @@ my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); my $default_timeout = $PostgreSQL::Test::Utils::timeout_default; my $psql_timeout = IPC::Run::timer($default_timeout); my $res; +my ($vacret, $vacstdout, $vacstderr) = ('', '', ''); +my $nb_vac = 0; +my $vac_attempts = 10; # Name for the physical slot on primary my $primary_slotname = 'primary_physical'; @@ -185,7 +189,7 @@ sub change_hot_standby_feedback_and_wait_for_xmins # Check conflicting status in pg_replication_slots. sub check_slots_conflicting_status { - my ($conflicting) = @_; + my ($conflicting, $details) = @_; if ($conflicting) { @@ -193,7 +197,7 @@ sub check_slots_conflicting_status 'postgres', qq( select bool_and(conflicting) from pg_replication_slots;)); - is($res, 't', "Logical slots are reported as conflicting"); + is($res, 't', "logical slots are reported as conflicting $details"); } else { @@ -201,7 +205,7 @@ sub check_slots_conflicting_status 'postgres', qq( select bool_or(conflicting) from pg_replication_slots;)); - is($res, 'f', "Logical slots are reported as non conflicting"); + is($res, 'f', "logical slots are reported as non conflicting $details"); } } @@ -256,6 +260,29 @@ sub check_for_invalidation ) or die "Timed out waiting confl_active_logicalslot to be updated"; } +# Try ($vac_attempts times max) to vacuum until it is able to remove dead rows. +sub vacuum_until_can_remove +{ + my ($vac_option, $sql, $to_vac) = @_; + + ($vacret, $vacstdout, $vacstderr) = ('', '', ''); + $nb_vac = 0; + + while ($nb_vac <= $vac_attempts) + { + last if ($vacstderr =~ /0 dead row versions cannot be removed yet/ || +$vacstderr =~ /0 are dead but not yet removable/); + # This should trigger the conflict + ($vacret, $vacstdout, $vacstderr) = $node_primary->psql( + 'testdb', qq[ + $sql + VACUUM $vac_option verbose $to_vac; + INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal + ]); + $nb_vac++; + usleep(100_000); + } +} # Initialize primary node