Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-06-10 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 03:39:34PM +0900, Michael Paquier wrote:
> On Mon, Jun 10, 2024 at 06:29:17AM +, Bertrand Drouvot wrote:
> > Thanks for the report! I think it makes sense to add it to the list of known
> > failures.
> > 
> > One way to deal with those corner cases could be to make use of injection 
> > points
> > around places where RUNNING_XACTS is emitted, thoughts?
> 
> Ah.  You mean to force a wait in the code path generating the standby
> snapshots for the sake of this test?

Yeah.

>  That's interesting, I like it.

Great, will look at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-06-10 Thread Michael Paquier
On Mon, Jun 10, 2024 at 06:29:17AM +, Bertrand Drouvot wrote:
> Thanks for the report! I think it makes sense to add it to the list of known
> failures.
> 
> One way to deal with those corner cases could be to make use of injection 
> points
> around places where RUNNING_XACTS is emitted, thoughts?

Ah.  You mean to force a wait in the code path generating the standby
snapshots for the sake of this test?  That's interesting, I like it.
--
Michael


signature.asc
Description: PGP signature


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-06-10 Thread Bertrand Drouvot
Hi Alexander,

On Sat, Jun 08, 2024 at 07:00:00AM +0300, Alexander Lakhin wrote:
> Hello Bertrand and Michael,
> 
> 23.01.2024 11:07, Bertrand Drouvot wrote:
> > On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote:
> > 
> > > Anyway, that's not the end of it.  What should we do for snapshot
> > > snapshot records coming from the bgwriter?
> > What about?
> > 
> > 3) depending on how stabilized this test (and others that suffer from 
> > "random"
> > xl_running_xacts) is, then think about the bgwriter.
> 
> A recent buildfarm failure [1] reminds me of that remaining question.
> 
> It's hard to determine from this info, why row_removal_activeslot was not
> invalidated, but running this test on a slowed down Windows VM, I (still)
> get the same looking failures caused by RUNNING_XACTS appeared just before
> `invalidating obsolete replication slot "row_removal_inactiveslot"`.
> So I would consider this failure as yet another result of bgwriter activity
> and add it to the list of known failures as such...

Thanks for the report! I think it makes sense to add it to the list of known
failures.

One way to deal with those corner cases could be to make use of injection points
around places where RUNNING_XACTS is emitted, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-06-07 Thread Alexander Lakhin

Hello Bertrand and Michael,

23.01.2024 11:07, Bertrand Drouvot wrote:

On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote:


Anyway, that's not the end of it.  What should we do for snapshot
snapshot records coming from the bgwriter?

What about?

3) depending on how stabilized this test (and others that suffer from "random"
xl_running_xacts) is, then think about the bgwriter.


A recent buildfarm failure [1] reminds me of that remaining question.
Here we have a slow machine (a successful run, for example [2], shows
541.13s duration of the test) and the following information logged:

[13:55:13.725](34.411s) ok 25 - inactiveslot slot invalidation is logged with 
vacuum on pg_class
[13:55:13.727](0.002s) not ok 26 - activeslot slot invalidation is logged with 
vacuum on pg_class
[13:55:13.728](0.001s) #   Failed test 'activeslot slot invalidation is logged 
with vacuum on pg_class'
#   at 
C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl 
line 229.
[14:27:42.995](1949.267s) # poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where 
datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
[14:27:42.999](0.004s) not ok 27 - confl_active_logicalslot updated
[14:27:43.000](0.001s) #   Failed test 'confl_active_logicalslot updated'
#   at 
C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl 
line 235.
Timed out waiting confl_active_logicalslot to be updated at 
C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.


---
035_standby_logical_decoding_standby.log:
2024-06-06 13:55:07.715 UTC [9172:7] LOG:  invalidating obsolete replication slot 
"row_removal_inactiveslot"
2024-06-06 13:55:07.715 UTC [9172:8] DETAIL:  The slot conflicted with xid 
horizon 754.
2024-06-06 13:55:07.715 UTC [9172:9] CONTEXT:  WAL redo at 0/4020A80 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 
754, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 1, nunused: 0, dead: [48]; blkref #0: rel 1663/16384/2610, blk 0

2024-06-06 13:55:14.372 UTC [7532:1] [unknown] LOG:  connection received: 
host=127.0.0.1 port=55328
2024-06-06 13:55:14.381 UTC [7532:2] [unknown] LOG:  connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" 
method=sspi 
(C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_hba.conf:2)
2024-06-06 13:55:14.381 UTC [7532:3] [unknown] LOG:  connection authorized: user=pgrunner database=postgres 
application_name=035_standby_logical_decoding.pl
2024-06-06 13:55:14.443 UTC [7532:4] 035_standby_logical_decoding.pl LOG:  statement: select (confl_active_logicalslot = 
1) from pg_stat_database_conflicts where datname = 'testdb'
2024-06-06 13:55:14.452 UTC [7532:5] 035_standby_logical_decoding.pl LOG:  disconnection: session time: 0:00:00.090 
user=pgrunner database=postgres host=127.0.0.1 port=55328

# (there is no `invalidating obsolete replication slot 
"row_removal_activeslot"` message)
...
2024-06-06 14:27:42.675 UTC [4032:4] 035_standby_logical_decoding.pl LOG:  statement: select (confl_active_logicalslot = 
1) from pg_stat_database_conflicts where datname = 'testdb'
2024-06-06 14:27:42.681 UTC [4032:5] 035_standby_logical_decoding.pl LOG:  disconnection: session time: 0:00:00.080 
user=pgrunner database=postgres host=127.0.0.1 port=58713
2024-06-06 14:27:43.095 UTC [7892:2] FATAL:  could not receive data from WAL stream: server closed the connection 
unexpectedly

    This probably means the server terminated abnormally
    before or while processing the request.

It's hard to determine from this info, why row_removal_activeslot was not
invalidated, but running this test on a slowed down Windows VM, I (still)
get the same looking failures caused by RUNNING_XACTS appeared just before
`invalidating obsolete replication slot "row_removal_inactiveslot"`.
So I would consider this failure as yet another result of bgwriter activity
and add it to the list of known failures as such...

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2024-06-06%2012%3A36%3A11
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo=2024-06-05%2017%3A03%3A13=misc-check

Best regards,
Alexander




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-23 Thread Bertrand Drouvot
Hi,

On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote:
> On Mon, Jan 22, 2024 at 09:07:45AM +, Bertrand Drouvot wrote:
> 
> I've rewritten some of that, and applied the patch down to v16.

Thanks!

> > Yeah, I can clearly see how this patch helps from a theoritical perspective 
> > but
> > rely on Alexander's testing to see how it actually stabilizes the test.
> 
> Anyway, that's not the end of it.  What should we do for snapshot
> snapshot records coming from the bgwriter?

I've mixed feeling about it. On one hand we'll decrease even more the risk of
seeing a xl_running_xacts in the middle of a test, but on the other hand I agree
with Tom's concern [1]: I think that we could miss "corner cases/bug" detection
(like the one reported in [2]).

What about?

1) Apply "wait_until_vacuum_can_remove() + autovacuum disabled" where it makes
sense and for tests that suffers from random xl_running_xacts. I can look at
031_recovery_conflict.pl, do you have others in mind?
2) fix [2]
3) depending on how stabilized this test (and others that suffer from "random"
xl_running_xacts) is, then think about the bgwriter.

[1]: https://www.postgresql.org/message-id/1375923.1705291719%40sss.pgh.pa.us
[2]: 
https://www.postgresql.org/message-id/flat/zatjw2xh+tquc...@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-22 Thread Michael Paquier
On Mon, Jan 22, 2024 at 09:07:45AM +, Bertrand Drouvot wrote:
> On Mon, Jan 22, 2024 at 03:54:44PM +0900, Michael Paquier wrote:
>> Also, wouldn't it be better to document in the test why
>> txid_current_snapshot() is chosen?  Contrary to txid_current(), it
>> does not generate a Transaction/COMMIT to make the test more
>> predictible, something you have mentioned upthread, and implied in the
>> test.
> 
> Good point, added more comments in v8 (but not mentioning txid_current() as 
> after giving more thought about it then I think it was not the right approach 
> in
> any case).

Fine by me.

>> -  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
>> 
>> This removes two INSERTs on flush_wal and refactors the code to do the
>> INSERT in wait_until_vacuum_can_remove(), using a SQL comment to
>> document a reference about the reason why an INSERT is used.  Couldn't
>> you just use a normal comment here?
> 
> Agree, done in v8.

I've rewritten some of that, and applied the patch down to v16.  Let's
see how this stabilizes things, but that's likely going to take some
time as it depends on skink's mood.

>> I need to review what you have more thoroughly, but is it OK to assume
>> that both of you are happy with the latest version of the patch in
>> terms of stability gained?  That's still not the full picture, still a
>> good step towards that.
> 
> Yeah, I can clearly see how this patch helps from a theoritical perspective 
> but
> rely on Alexander's testing to see how it actually stabilizes the test.

Anyway, that's not the end of it.  What should we do for snapshot
snapshot records coming from the bgwriter?  The slower the system, the
higher the odds of hitting a conflict with such records, even if the
horizon check should help.
--
Michael


signature.asc
Description: PGP signature


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-22 Thread Bertrand Drouvot
Hi,

On Mon, Jan 22, 2024 at 03:54:44PM +0900, Michael Paquier wrote:
> On Fri, Jan 19, 2024 at 09:03:01AM +, Bertrand Drouvot wrote:
> > On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
> +# Launch $sql and wait for a new snapshot that has a newer horizon before
> +# doing the vacuum with $vac_option on $to_vac.
> +sub wait_until_vacuum_can_remove
> 
> This had better document what the arguments of this routine are,
> because that's really unclear.  $to_vac is the relation that will be
> vacuumed, for one.

Agree, done that way in v8 attached.

> Also, wouldn't it be better to document in the test why
> txid_current_snapshot() is chosen?  Contrary to txid_current(), it
> does not generate a Transaction/COMMIT to make the test more
> predictible, something you have mentioned upthread, and implied in the
> test.

Good point, added more comments in v8 (but not mentioning txid_current() as 
after giving more thought about it then I think it was not the right approach in
any case).

> 
> -  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
> 
> This removes two INSERTs on flush_wal and refactors the code to do the
> INSERT in wait_until_vacuum_can_remove(), using a SQL comment to
> document a reference about the reason why an INSERT is used.  Couldn't
> you just use a normal comment here?

Agree, done in v8.

> >> I've re-tested the v6 patch today and confirmed that it makes the test
> >> more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in
> >> parallel and got failures ('inactiveslot slot invalidation is logged with
> >> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
> >> (With unlimited CPU, when average test duration is around 70 seconds.)
> >> 
> >> But with v6 applied, 60 iterations succeeded.
> > 
> > Nice! Thanks for the testing!
> 
> I need to review what you have more thoroughly, but is it OK to assume
> that both of you are happy with the latest version of the patch in
> terms of stability gained?  That's still not the full picture, still a
> good step towards that.

Yeah, I can clearly see how this patch helps from a theoritical perspective but
rely on Alexander's testing to see how it actually stabilizes the test.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 1380408965d8c7cb1a2e63043af7d1b8033a7e89 Mon Sep 17 00:00:00 2001
From: bdrouvot 
Date: Tue, 9 Jan 2024 05:08:35 +
Subject: [PATCH v8] Fix 035_standby_logical_decoding.pl race condition

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.

While at it, also fixing some typos/bad test description in it.
---
 .../t/035_standby_logical_decoding.pl | 78 +++
 1 file changed, 45 insertions(+), 33 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..6664405772 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -238,6 +238,35 @@ sub check_for_invalidation
 	) or die "Timed out waiting confl_active_logicalslot to be updated";
 }
 
+# Launch $sql and wait for a new snapshot that has a newer horizon before
+# doing the vacuum. Arguments are: $vac_option (the option to be passed to the
+# vacuum command), $sql (the sql to launch before triggering the vacuum) and
+# $to_vac (the relation to vacuum).
+# Note that to get the horizon we're using pg_current_snapshot() (it does not
+# generate a Transaction/COMMIT wal record, so we don't increase the risk to see
+# a xl_running_xacts that would advance active replication slot's catalog_xmin).
+# Indeed advancing the active replication slot's catalog_xmin would break some
+# tests that expect the active slot to conflict with the catalog xmin horizon.
+sub wait_until_vacuum_can_remove
+{
+	my ($vac_option, $sql, $to_vac) = @_;
+
+	# Get the current xid horizon
+	my $xid_horizon = $node_primary->safe_psql('testdb', qq[select pg_snapshot_xmin(pg_current_snapshot());]);
+
+	# Launch our sql
+	$node_primary->safe_psql('testdb', qq[$sql]);
+
+	# Wait until we get a newer horizon
+	$node_primary->poll_query_until('testdb',
+		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0")
+	  or die "new snapshot does not have a newer horizon";
+
+	# Launch the vacuum command and insert into flush_wal (see create table
+	# flush_wal for the reason why).
+	$node_primary->safe_psql('testdb', qq[VACUUM $vac_option verbose $to_vac;
+		  INSERT INTO flush_wal DEFAULT VALUES;]);
+}
 
 # Initialize primary node
 
@@ -248,6 +277,7 @@ $node_primary->append_conf(
 wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-21 Thread Michael Paquier
On Fri, Jan 19, 2024 at 09:03:01AM +, Bertrand Drouvot wrote:
> On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
>> 15.01.2024 12:00, Alexander Lakhin wrote:
>>> If this approach looks promising to you, maybe we could add a submodule to
>>> perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
>>> 019_replslot_limit) as well.
>>> 
>>> Personally I think that having such a functionality for using in tests
>>> might be useful not only to avoid some "problematic" behaviour but also to
>>> test the opposite cases.
>> 
>> After spending a few days on it, I've discovered two more issues:
>> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
>> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
>> 
>> (The latter is not related to bgwriter directly, but it was discovered
>> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)
>> 
>> So it becomes clear that the 035 test is not the only one, which might
>> suffer from bgwriter's activity,
> 
> Yeah... thanks for sharing!
> 
>> and inventing a way to stop bgwriter/
>> control it is a different subject, getting out of scope of the current
>> issue.
> 
> Agree.
> 
>> 15.01.2024 11:49, Bertrand Drouvot wrote:
>> Maybe it would be a right move to commit the fix, and then think about
>> more rare failures.
> 
> +1

Yeah, agreed to make things more stable before making them fancier.

>> 2) Shall we align the 035_standby_logical_decoding with
>> 031_recovery_conflict in regard to improving stability of vacuum?
> 
> Yeah, I think that could make sense.

Probably.  That can always be done as a separate change, after a few
runs of the slow buildfarm members able to reproduce the failure.

>> I see the following options for this:
>> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
>> b) use FREEZE and autovacuum = off in both tests;
>> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
>>  autovacuum = off in both.
> 
> I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not
> sure using FREEZE would give full "stabilization predictability", at least for
> 035_standby_logical_decoding.pl). That said I did not test what the outcome 
> would
> be for 031_recovery_conflict.pl by making use of a).

Yeah, I think I agree here with a), as v7 does for 035.

+# Launch $sql and wait for a new snapshot that has a newer horizon before
+# doing the vacuum with $vac_option on $to_vac.
+sub wait_until_vacuum_can_remove

This had better document what the arguments of this routine are,
because that's really unclear.  $to_vac is the relation that will be
vacuumed, for one.

Also, wouldn't it be better to document in the test why
txid_current_snapshot() is chosen?  Contrary to txid_current(), it
does not generate a Transaction/COMMIT to make the test more
predictible, something you have mentioned upthread, and implied in the
test.

-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal

This removes two INSERTs on flush_wal and refactors the code to do the
INSERT in wait_until_vacuum_can_remove(), using a SQL comment to
document a reference about the reason why an INSERT is used.  Couldn't
you just use a normal comment here?

>> I've re-tested the v6 patch today and confirmed that it makes the test
>> more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in
>> parallel and got failures ('inactiveslot slot invalidation is logged with
>> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
>> (With unlimited CPU, when average test duration is around 70 seconds.)
>> 
>> But with v6 applied, 60 iterations succeeded.
> 
> Nice! Thanks for the testing!

I need to review what you have more thoroughly, but is it OK to assume
that both of you are happy with the latest version of the patch in
terms of stability gained?  That's still not the full picture, still a
good step towards that.
--
Michael


signature.asc
Description: PGP signature


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-19 Thread Bertrand Drouvot
Hi,

On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
> Hello,
> 
> 15.01.2024 12:00, Alexander Lakhin wrote:
> > If this approach looks promising to you, maybe we could add a submodule to
> > perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
> > 019_replslot_limit) as well.
> > 
> > Personally I think that having such a functionality for using in tests
> > might be useful not only to avoid some "problematic" behaviour but also to
> > test the opposite cases.
> 
> After spending a few days on it, I've discovered two more issues:
> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
> 
> (The latter is not related to bgwriter directly, but it was discovered
> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)
> 
> So it becomes clear that the 035 test is not the only one, which might
> suffer from bgwriter's activity,

Yeah... thanks for sharing!

> and inventing a way to stop bgwriter/
> control it is a different subject, getting out of scope of the current
> issue.

Agree.

> 15.01.2024 11:49, Bertrand Drouvot wrote:
> > We did a few things in this thread, so to sum up what we've discovered:
> > 
> > - a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
> > - we need to launch the vacuum(s) only if we are sure we got a newer XID 
> > horizon
> > ( proposal in in v6 attached)
> > - we need a way to control how frequent xl_running_xacts are emmitted (to 
> > ensure
> > they are not triggered in a middle of an active slot invalidation test).
> > 
> > I'm not sure it's possible to address Tom's concern and keep the test 
> > "predictable".
> > 
> > So, I think I'd vote for Michael's proposal to implement a 
> > superuser-settable
> > developer GUC (as sending a SIGSTOP on the bgwriter (and bypass 
> > $windows_os) would
> > still not address Tom's concern anyway).
> > 
> > Another option would be to "sacrifice" the full predictablity of the test 
> > (in
> > favor of real-world behavior testing)?
> > 
> > [1]: 
> > https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal
> 
> So, now we have the test 035 failing due to nondeterministic vacuum
> activity in the first place, and due to bgwriter's activity in the second.

Yeah, that's also my understanding.

> Maybe it would be a right move to commit the fix, and then think about
> more rare failures.

+1

> Though I have a couple of question regarding the fix left, if you don't
> mind:
> 1) The test has minor defects in the comments, that I noted before [1];
> would you like to fix them in passing?
> 
> > BTW, it looks like the comment:
> > # One way to produce recovery conflict is to create/drop a relation and
> > # launch a vacuum on pg_class with hot_standby_feedback turned off on the 
> > standby.
> > in scenario 3 is a copy-paste error.

Nice catch, thanks! Fixed in v7 attached.

> > Also, there are two "Scenario 4" in this test.

D'oh! Fixed in v7.

> > 
> 
> 2) Shall we align the 035_standby_logical_decoding with
> 031_recovery_conflict in regard to improving stability of vacuum?

Yeah, I think that could make sense.

> I see the following options for this:
> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
> b) use FREEZE and autovacuum = off in both tests;
> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
>  autovacuum = off in both.
>

I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not
sure using FREEZE would give full "stabilization predictability", at least for
035_standby_logical_decoding.pl). That said I did not test what the outcome 
would
be for 031_recovery_conflict.pl by making use of a).

> I've re-tested the v6 patch today and confirmed that it makes the test
> more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in
> parallel and got failures ('inactiveslot slot invalidation is logged with
> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
> (With unlimited CPU, when average test duration is around 70 seconds.)
> 
> But with v6 applied, 60 iterations succeeded.

Nice! Thanks for the testing!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From b45f0c55d17cddfac7c99d11000b819c8b09fa56 Mon Sep 17 00:00:00 2001
From: bdrouvot 
Date: Tue, 9 Jan 2024 05:08:35 +
Subject: [PATCH v7] Fix 035_standby_logical_decoding.pl race condition

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.

While at it, also fixing some typos/bad test description in it.
---
 .../t/035_standby_logical_decoding.pl | 68 ++-
 1 file changed, 35 insertions(+), 33 deletions(-)
 100.0% src/test/recovery/t/

diff --git 

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-18 Thread Alexander Lakhin

Hello,

15.01.2024 12:00, Alexander Lakhin wrote:

If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.

Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.


After spending a few days on it, I've discovered two more issues:
https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com

(The latter is not related to bgwriter directly, but it was discovered
thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)

So it becomes clear that the 035 test is not the only one, which might
suffer from bgwriter's activity, and inventing a way to stop bgwriter/
control it is a different subject, getting out of scope of the current
issue.


15.01.2024 11:49, Bertrand Drouvot wrote:

We did a few things in this thread, so to sum up what we've discovered:

- a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
- we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
( proposal in in v6 attached)
- we need a way to control how frequent xl_running_xacts are emmitted (to ensure
they are not triggered in a middle of an active slot invalidation test).

I'm not sure it's possible to address Tom's concern and keep the test 
"predictable".

So, I think I'd vote for Michael's proposal to implement a superuser-settable
developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) 
would
still not address Tom's concern anyway).

Another option would be to "sacrifice" the full predictablity of the test (in
favor of real-world behavior testing)?

[1]: 
https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal


So, now we have the test 035 failing due to nondeterministic vacuum
activity in the first place, and due to bgwriter's activity in the second.
Maybe it would be a right move to commit the fix, and then think about
more rare failures.

Though I have a couple of question regarding the fix left, if you don't
mind:
1) The test has minor defects in the comments, that I noted before [1];
would you like to fix them in passing?


BTW, it looks like the comment:
# One way to produce recovery conflict is to create/drop a relation and
# launch a vacuum on pg_class with hot_standby_feedback turned off on the 
standby.
in scenario 3 is a copy-paste error.
Also, there are two "Scenario 4" in this test.



2) Shall we align the 035_standby_logical_decoding with
031_recovery_conflict in regard to improving stability of vacuum?
I see the following options for this:
a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
b) use FREEZE and autovacuum = off in both tests;
c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
 autovacuum = off in both.

I've re-tested the v6 patch today and confirmed that it makes the test
more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in
parallel and got failures ('inactiveslot slot invalidation is logged with
vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
(With unlimited CPU, when average test duration is around 70 seconds.)

But with v6 applied, 60 iterations succeeded.

[1] 
https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b...@gmail.com

Best regards,
Alexander




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-15 Thread Alexander Lakhin

Hello Michael and Bertrand,

15.01.2024 06:59, Michael Paquier wrote:

The WAL records related to standby snapshots are playing a lot with
the randomness of the failures we are seeing.  Alexander has mentioned
offlist something else: using SIGSTOP on the bgwriter to avoid these
records and make the test more stable.  That would not be workable for
Windows, but I could live with that knowing that logical decoding for
standbys has no platform-speficic tweak for the code paths we're
testing here, and that would put as limitation to skip the test for
$windows_os.


I've found a way to implement pause/resume for Windows processed and it
looks acceptable to me if we can afford "use Win32::API;" on Windows
(maybe the test could be skipped only if this perl module is absent).
Please look at the PoC patch for the test 035_standby_logical_decoding.
(The patched test passes for me.)

If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.

Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.


While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test.  This requires a small
implementation, but nothing really huge, while being portable
everywhere.  And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.


I see that the test in question exists in REL_16_STABLE, it means that a
new GUC would not help there?

Best regards,
Alexanderdiff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..8b08c7b5c7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,8 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+use Win32::API;
+
 my ($stdin, $stdout, $stderr,
 	$cascading_stdout, $cascading_stderr, $subscriber_stdin,
 	$subscriber_stdout, $subscriber_stderr, $ret,
@@ -28,6 +30,28 @@ my $res;
 my $primary_slotname = 'primary_physical';
 my $standby_physical_slotname = 'standby_physical';
 
+my $OpenProcess = new Win32::API("kernel32", "HANDLE OpenProcess(DWORD dwDesiredAccess, BOOL bInheritHandle, DWORD dwProcessId )");
+my $NtSuspendProcess = new Win32::API("ntdll", 'LONG NtSuspendProcess(HANDLE hProcess)');
+my $NtResumeProcess = new Win32::API("ntdll", 'LONG NtResumeProcess(HANDLE hProcess)');
+my $CloseHandle = new Win32::API("kernel32", "BOOL CloseHandle(HANDLE hObject)");
+my $PROCESS_ALL_ACCESS = 0x001F0FFF;
+
+sub suspend_process
+{
+	my $pid = shift;
+	my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid);
+	$NtSuspendProcess->Call($hProcess);
+	$CloseHandle->Call($hProcess);
+}
+
+sub resume_process
+{
+	my $pid = shift;
+	my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid);
+	$NtResumeProcess->Call($hProcess);
+	$CloseHandle->Call($hProcess);
+}
+
 # Fetch xmin columns from slot's pg_replication_slots row, after waiting for
 # given boolean condition to be true to ensure we've reached a quiescent state.
 sub wait_for_xmins
@@ -456,6 +480,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+my $bgwriterpid = $node_primary->safe_psql('postgres', "SELECT pid FROM pg_stat_activity WHERE backend_type = 'background writer'");
+suspend_process($bgwriterpid);
+
 ##
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -690,6 +717,7 @@ $node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(25);]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 
 $node_primary->wait_for_replay_catchup($node_standby);
@@ -709,6 +737,8 @@ check_pg_recvlogical_stderr($handle,
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
+resume_process($bgwriterpid);
+
 ##
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 5: incorrect wal_level on primary.


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-15 Thread Bertrand Drouvot
Hi,

On Mon, Jan 15, 2024 at 01:11:26PM +0900, Michael Paquier wrote:
> On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote:
> > Michael Paquier  writes:
> >> While thinking about that, a second idea came into my mind: a
> >> superuser-settable developer GUC to disable such WAL records to be
> >> generated within certain areas of the test.  This requires a small
> >> implementation, but nothing really huge, while being portable
> >> everywhere.  And it is not the first time I've been annoyed with these
> >> records when wanting a predictible set of WAL records for some test
> >> case.
> > 
> > Hmm ... I see what you are after, but to what extent would this mean
> > that what we are testing is not our real-world behavior?
> 
> Don't think so.  We don't care much about these records when it comes
> to checking slot invalidation scenarios with a predictible XID
> horizon, AFAIK.

Yeah, we want to test slot invalidation behavior so we need to ensure that such
an invalidation occur (which is not the case if we get a xl_running_xacts in the
middle) at the first place.

OTOH I also see Tom's point: for example I think we'd not have discovered [1]
(outside from the field) with such a developer GUC in place.

We did a few things in this thread, so to sum up what we've discovered:

- a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
- we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
( proposal in in v6 attached)
- we need a way to control how frequent xl_running_xacts are emmitted (to ensure
they are not triggered in a middle of an active slot invalidation test).

I'm not sure it's possible to address Tom's concern and keep the test 
"predictable".

So, I think I'd vote for Michael's proposal to implement a superuser-settable
developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) 
would
still not address Tom's concern anyway).

Another option would be to "sacrifice" the full predictablity of the test (in
favor of real-world behavior testing)?

[1]: 
https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a35a308626b6b61c3994531cbf89fe835f4842c2 Mon Sep 17 00:00:00 2001
From: bdrouvot 
Date: Tue, 9 Jan 2024 05:08:35 +
Subject: [PATCH v6] Fix 035_standby_logical_decoding.pl race condition

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.
---
 .../t/035_standby_logical_decoding.pl | 59 ++-
 1 file changed, 30 insertions(+), 29 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..9bfa8833b5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -238,6 +238,25 @@ sub check_for_invalidation
 	) or die "Timed out waiting confl_active_logicalslot to be updated";
 }
 
+# Launch $sql and wait for a new snapshot that has a newer horizon before
+# doing the vacuum with $vac_option on $to_vac.
+sub wait_until_vacuum_can_remove
+{
+	my ($vac_option, $sql, $to_vac) = @_;
+
+	# get the current xid horizon
+	my $xid_horizon = $node_primary->safe_psql('testdb', qq[select pg_snapshot_xmin(pg_current_snapshot());]);
+
+	# launch our sql
+	$node_primary->safe_psql('testdb', qq[$sql]);
+
+	$node_primary->poll_query_until('testdb',
+		"SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0")
+	  or die "new snapshot does not have a newer horizon";
+
+	$node_primary->safe_psql('testdb', qq[VACUUM $vac_option verbose $to_vac;
+		  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal;]);
+}
 
 # Initialize primary node
 
@@ -248,6 +267,7 @@ $node_primary->append_conf(
 wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
+autovacuum = off
 });
 $node_primary->dump_info;
 $node_primary->start;
@@ -468,13 +488,8 @@ reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 	0, 1);
 
 # This should trigger the conflict
-$node_primary->safe_psql(
-	'testdb', qq[
-  CREATE TABLE conflict_test(x integer, y text);
-  DROP TABLE conflict_test;
-  VACUUM full pg_class;
-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
-]);
+wait_until_vacuum_can_remove('full', 'CREATE TABLE conflict_test(x integer, y text);
+ DROP TABLE conflict_test;', 'pg_class');
 
 $node_primary->wait_for_replay_catchup($node_standby);
 
@@ -550,13 +565,8 @@ reactive_slots_change_hfs_and_wait_for_xmins('vacuum_full_', 'row_removal_',
 	0, 1);
 
 # This should trigger the conflict
-$node_primary->safe_psql(
-	'testdb', qq[
-  CREATE 

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-14 Thread Michael Paquier
On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote:
> Michael Paquier  writes:
>> While thinking about that, a second idea came into my mind: a
>> superuser-settable developer GUC to disable such WAL records to be
>> generated within certain areas of the test.  This requires a small
>> implementation, but nothing really huge, while being portable
>> everywhere.  And it is not the first time I've been annoyed with these
>> records when wanting a predictible set of WAL records for some test
>> case.
> 
> Hmm ... I see what you are after, but to what extent would this mean
> that what we are testing is not our real-world behavior?

Don't think so.  We don't care much about these records when it comes
to checking slot invalidation scenarios with a predictible XID
horizon, AFAIK.
--
Michael


signature.asc
Description: PGP signature


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-14 Thread Tom Lane
Michael Paquier  writes:
> While thinking about that, a second idea came into my mind: a
> superuser-settable developer GUC to disable such WAL records to be
> generated within certain areas of the test.  This requires a small
> implementation, but nothing really huge, while being portable
> everywhere.  And it is not the first time I've been annoyed with these
> records when wanting a predictible set of WAL records for some test
> case.

Hmm ... I see what you are after, but to what extent would this mean
that what we are testing is not our real-world behavior?

regards, tom lane




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-14 Thread Michael Paquier
On Fri, Jan 12, 2024 at 01:46:08PM +, Bertrand Drouvot wrote:
> 1) Michael's proposal up-thread (means tweak the test with a retry logic, 
> retrying
> things if such a standby snapshot is found).
> 
> 2) Don't report a test error for active slots in case its catalog_xmin 
> advanced.
> 
> I'd vote for 2) as:
> 
> - this is a corner case and the vast majority of the animals don't report any
> issues (means the active slot conflict detection is already well covered).
> 
> - even on the same animal it should be pretty rare to not have an active slot 
> conflict detection not covered at all (and the "failing" one would be probably
> moving over time).
> 
> - It may be possible that 1) ends up failing (as we'd need to put a limit on 
> the
> retry logic anyhow).
> 
> What do you think?
> 
> And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's
> fully correct, so I'll give it another look.

The WAL records related to standby snapshots are playing a lot with
the randomness of the failures we are seeing.  Alexander has mentioned
offlist something else: using SIGSTOP on the bgwriter to avoid these
records and make the test more stable.  That would not be workable for
Windows, but I could live with that knowing that logical decoding for
standbys has no platform-speficic tweak for the code paths we're
testing here, and that would put as limitation to skip the test for
$windows_os.

While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test.  This requires a small
implementation, but nothing really huge, while being portable
everywhere.  And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.

Another possibility would be to move these records elsewhere, outside
of the bgwriter, but we need such records at a good frequency for the
availability of read-only standbys.  And surely we'd want an on/off
switch anyway to get a full control for test sequences.
--
Michael


signature.asc
Description: PGP signature


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-12 Thread Bertrand Drouvot
Hi,

On Fri, Jan 12, 2024 at 02:00:01PM +0300, Alexander Lakhin wrote:
> Hi,
> 
> 12.01.2024 10:15, Bertrand Drouvot wrote:
> > 
> > For this one, the "good" news is that it looks like that we don’t see the
> > "terminating" message not followed by an "obsolete" message (so the engine
> > behaves correctly) anymore.
> > 
> > There is simply nothing related to the row_removal_activeslot at all (the 
> > catalog_xmin
> > advanced and there is no conflict).
> 
> Yes, judging from all the failures that we see now, it looks like the
> 0001-Fix-race-condition...patch works as expected.

Yeah, thanks for confirming, I'll create a dedicated hackers thread for that 
one.

> > Let's give v5 a try? (please apply 
> > v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
> > too).
> 
> Unfortunately, I've got the failure again (please see logs attached).
> (_primary.log can confirm that I have used exactly v5 — I see no
> txid_current() calls there...)

Okay ;-( Thanks for the testing. Then I can think of:

1) Michael's proposal up-thread (means tweak the test with a retry logic, 
retrying
things if such a standby snapshot is found).

2) Don't report a test error for active slots in case its catalog_xmin advanced.

I'd vote for 2) as:

- this is a corner case and the vast majority of the animals don't report any
issues (means the active slot conflict detection is already well covered).

- even on the same animal it should be pretty rare to not have an active slot 
conflict detection not covered at all (and the "failing" one would be probably
moving over time).

- It may be possible that 1) ends up failing (as we'd need to put a limit on the
retry logic anyhow).

What do you think?

And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's
fully correct, so I'll give it another look.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-11 Thread Bertrand Drouvot
Hi,

On Fri, Jan 12, 2024 at 07:01:55AM +0900, Michael Paquier wrote:
> On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:
> > Bertrand, I've relaunched tests in the same slowed down VM with both
> > patches applied (but with no other modifications) and got a failure
> > with pg_class, similar to what we had seen before:
> > 9   #   Failed test 'activeslot slot invalidation is logged with vacuum 
> > on pg_class'
> > 9   #   at t/035_standby_logical_decoding.pl line 230.
> > 
> > Please look at the logs attached (I see there Standby/RUNNING_XACTS near
> > 'invalidating obsolete replication slot "row_removal_inactiveslot"').

Thanks! 

For this one, the "good" news is that it looks like that we don’t see the
"terminating" message not followed by an "obsolete" message (so the engine
behaves correctly) anymore.

There is simply nothing related to the row_removal_activeslot at all (the 
catalog_xmin
advanced and there is no conflict).

And I agree that this is due to the Standby/RUNNING_XACTS that is "advancing" 
the
catalog_xmin of the active slot.

> Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
> = minimal, because these lead to unpredictible records inserted,
> impacting the reliability of the tests.  We cannot do that here,
> obviously.  That may be a long shot, but could it be possible to tweak
> the test with a retry logic, retrying things if such a standby
> snapshot is found because we know that the invalidation is not going
> to work anyway?

I think it all depends what the xl_running_xacts does contain (means does it
"advance" or not the catalog_xmin in our case).

In our case it does advance it (should it occurs) due to the "select 
txid_current()"
that is done in wait_until_vacuum_can_remove() in 
035_standby_logical_decoding.pl.

I suggest to make use of txid_current_snapshot() instead (that does not produce
a Transaction/COMMIT wal record, as opposed to txid_current()).

I think that it could be "enough" for our case here, and it's what v5 attached 
is
now doing.

Let's give v5 a try? (please apply 
v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
too).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From b1c316a85a2dd43a5ab33adbda1ee82020d84bb1 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 11 Jan 2024 13:36:57 +
Subject: [PATCH v5] Fix 035_standby_logical_decoding.pl race conditions

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.

Also, get rid of the active slot invalidation check during the on-access pruning
check. This test is racy for active slots and active slots invalidations are well
covered in other tests.
---
 .../t/035_standby_logical_decoding.pl | 114 +-
 1 file changed, 59 insertions(+), 55 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..dd4149c8bc 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -191,9 +191,9 @@ sub check_slots_conflict_reason
 
 # Drop the slots, re-create them, change hot_standby_feedback,
 # check xmin and catalog_xmin values, make slot active and reset stat.
-sub reactive_slots_change_hfs_and_wait_for_xmins
+sub recreate_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
+	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated, $activate) = @_;
 
 	# drop the logical slots
 	drop_logical_slots($previous_slot_prefix);
@@ -203,8 +203,11 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	change_hot_standby_feedback_and_wait_for_xmins($hsf, $invalidated);
 
-	$handle =
-	  make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	if ($activate)
+	{
+		$handle =
+			make_slot_active($node_standby, $slot_prefix, 1, \$stdout, \$stderr);
+	}
 
 	# reset stat: easier to check for confl_active_logicalslot in pg_stat_database_conflicts
 	$node_standby->psql('testdb', q[select pg_stat_reset();]);
@@ -213,7 +216,7 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 # Check invalidation in the logfile and in pg_stat_database_conflicts
 sub check_for_invalidation
 {
-	my ($slot_prefix, $log_start, $test_name) = @_;
+	my ($slot_prefix, $log_start, $test_name, $activated) = @_;
 
 	my $active_slot = $slot_prefix . 'activeslot';
 	my $inactive_slot = $slot_prefix . 'inactiveslot';
@@ -230,14 +233,33 @@ sub check_for_invalidation
 		"activeslot slot invalidation is logged $test_name");
 
 	# Verify that pg_stat_database_conflicts.confl_active_logicalslot has been updated
-	ok( $node_standby->poll_query_until(
-			'postgres',
-			"select (confl_active_logicalslot = 1) from 

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-11 Thread Michael Paquier
On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:
> Bertrand, I've relaunched tests in the same slowed down VM with both
> patches applied (but with no other modifications) and got a failure
> with pg_class, similar to what we had seen before:
> 9   #   Failed test 'activeslot slot invalidation is logged with vacuum 
> on pg_class'
> 9   #   at t/035_standby_logical_decoding.pl line 230.
> 
> Please look at the logs attached (I see there Standby/RUNNING_XACTS near
> 'invalidating obsolete replication slot "row_removal_inactiveslot"').

Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
= minimal, because these lead to unpredictible records inserted,
impacting the reliability of the tests.  We cannot do that here,
obviously.  That may be a long shot, but could it be possible to tweak
the test with a retry logic, retrying things if such a standby
snapshot is found because we know that the invalidation is not going
to work anyway?
--
Michael


signature.asc
Description: PGP signature


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-11 Thread Bertrand Drouvot
Hi,

On Thu, Jan 11, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> 10.01.2024 19:32, Bertrand Drouvot wrote:
> > 
> > > is an absent message 'obsolete replication slot "row_removal_activeslot"'
> > Looking at 
> > recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:
> > 
> > Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().
> > 
> > In case of an active slot we first call ReportSlotInvalidation with the 
> > second
> > parameter set to true (to emit the "terminating" message), then SIGTERM the 
> > active
> > process and then (later) we should call the other ReportSlotInvalidation()
> > call with the second parameter set to false (to emit the message that we 
> > don't
> > see here).
> > 
> > So it means InvalidatePossiblyObsoleteSlot() did not trigger the second 
> > ReportSlotInvalidation()
> > call.
> 
> I've found a way to reproduce the issue without slowing down a machine or
> running multiple tests in parallel. It's enough for this to add a delay to
> allow BackgroundWriterMain() to execute LogStandbySnapshot():
> @@ -692,2 +690,3 @@
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
> +$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
> 
> With this delay, I get the failure immediately:
> $ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C 
> src/test/recovery
> # +++ tap check in src/test/recovery +++
> t/035_standby_logical_decoding.pl .. 47/?
> #   Failed test 'activeslot slot invalidation is logged with on-access 
> pruning'
> #   at t/035_standby_logical_decoding.pl line 227.

Thanks a lot for the testing!

So I think we have 2 issues here:

1) The one you're mentioning above related to the on-access pruning test:

I think the engine behavior is right here and that the test is racy. I'm
proposing to bypass the active slot invalidation check for this particular test 
(
as I don't see any "easy" way to take care of this race condition). The active 
slot
invalidation is already well covered in the other tests anyway. 

I'm proposing the attached 
v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch
for it.

2) The fact that sometime we're getting a termination message which is not 
followed
by an obsolete one (like as discussed in [1]).

For this one, I think that InvalidatePossiblyObsoleteSlot() is racy:

In case of an active slot we proceed in 2 steps:
 - terminate the backend holding the slot
 - report the slot as obsolete

This is racy because between the two we release the mutex on the slot, which
means the effective_xmin and effective_catalog_xmin could advance during that 
time.

I'm proposing the attached 
v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
for it.

Would it be possible to re-launch your repro (the slow one, not the pg_sleep() 
one)
with bot patch applied and see how it goes? (Please note that v4 replaces v3 
that
you're already using in your tests).

If it helps, I'll propose 
v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
into a dedicated hackers thread.

[1]: 
https://www.postgresql.org/message-id/ZZ7GpII4bAYN%2BjT5%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 8347851b3cc42655cfd81914f0c2f5cc1d22e2b8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 11 Jan 2024 13:36:57 +
Subject: [PATCH v4] Fix 035_standby_logical_decoding.pl race conditions

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation on
the standby.

Also, get rid of the active slot invalidation check during the on-access pruning
check. This test is racy for active slots and active slots invalidations are well
covered in other tests.
---
 .../t/035_standby_logical_decoding.pl | 114 +-
 1 file changed, 59 insertions(+), 55 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..7a50187326 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -191,9 +191,9 @@ sub check_slots_conflict_reason
 
 # Drop the slots, re-create them, change hot_standby_feedback,
 # check xmin and catalog_xmin values, make slot active and reset stat.
-sub reactive_slots_change_hfs_and_wait_for_xmins
+sub recreate_slots_change_hfs_and_wait_for_xmins
 {
-	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated) = @_;
+	my ($previous_slot_prefix, $slot_prefix, $hsf, $invalidated, $activate) = @_;
 
 	# drop the logical slots
 	drop_logical_slots($previous_slot_prefix);
@@ -203,8 +203,11 @@ sub reactive_slots_change_hfs_and_wait_for_xmins
 
 	

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-10 Thread Bertrand Drouvot
Hi,

On Wed, Jan 10, 2024 at 05:00:01PM +0300, Alexander Lakhin wrote:
> 10.01.2024 12:46, Bertrand Drouvot wrote:
> 
> > Would it be possible to also send the standby logs?
> 
> Yes, please look at the attached logs. This time I've build postgres with
> -DWAL_DEBUG and ran tests with TEMP_CONFIG as below:
> wal_keep_size=1GB
> wal_debug = on
> log_autovacuum_min_duration = 0
> log_statement = 'all'
> log_min_messages = INFO
> 
> The archive attached contains logs from four runs:
> recovery-1-ok -- an example of successful run for reference
> recovery-7-pruning and recovery-19-pruning -- failures with a failed
>  subtest 'activeslot slot invalidation is logged with on-access pruning'
> recovery-15-vacuum_pg_class -- a failure with a failed
>  subtest 'activeslot slot invalidation is logged with vacuum on pg_class'

Thanks a lot for the testing!

> is an absent message 'obsolete replication slot "row_removal_activeslot"'

Looking at 
recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:

Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().

In case of an active slot we first call ReportSlotInvalidation with the second
parameter set to true (to emit the "terminating" message), then SIGTERM the 
active
process and then (later) we should call the other ReportSlotInvalidation()
call with the second parameter set to false (to emit the message that we don't
see here).

So it means InvalidatePossiblyObsoleteSlot() did not trigger the second 
ReportSlotInvalidation()
call. 

The thing it that it looks like we exited the loop in 
InvalidatePossiblyObsoleteSlot()
because there is more messages from the startup process (789618) after the:


"
2024-01-10 11:00:29.109 UTC [789618] LOG:  terminating process 790377 to 
release replication slot "row_removal_activeslot"
"

one.

Do you think you could try to add more debug messages in 
InvalidatePossiblyObsoleteSlot()
to understand why the second call to ReportSlotInvalidation() is not done and 
IIUC
why/how we exited the loop?

FWIW, I did try to reproduce by launching pg_recvlogical and then kill -SIGSTOP
it. Then producing a conflict, I'm able to get the first message and not the 
second
one (which is expected). But the startup process does not exit the loop, which 
is
expected here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-10 Thread Bertrand Drouvot
Hi,

On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> Michael, it definitely increases stability of the test (tens of iterations
> with 20 tests in parallel performed successfully),

Thanks for testing!

> although I've managed to
> see another interesting failure (twice):
> 13  #   Failed test 'activeslot slot invalidation is logged with vacuum 
> on pg_class'
> 13  #   at t/035_standby_logical_decoding.pl line 227.
> 

Looking at the attached log files and particularly 
1/regress_log_035_standby_logical_decoding:

"
[11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with 
vacuum on pg_class
[11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with 
vacuum on pg_class
"

That seems weird, the inactive slot has been invalidated while the active one 
is not.
While it takes a bit longer to invalidate an active slot, I don't think the 
test can
move on until both are invalidated (then leading to the tests 24 and 25)). I can
see the tests are very slow to run (13.993s for 24) but still don't get how 24 
could
succeed while 25 does not.

Looking at 2/regress_log_035_standby_logical_decoding:

"
[13:41:02.076](20.279s) ok 23 - inactiveslot slot invalidation is logged with 
vacuum on pg_class
[13:41:02.076](0.000s) not ok 24 - activeslot slot invalidation is logged with 
vacuum on pg_class
"

Same "weird" behavior but this time the tests numbering are not the same (23 
and 24).
That is even more weird as those tests should be the 24 and 25 ones.

Would it be possible to also send the standby logs?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-09 Thread Michael Paquier
On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
> 0174c2d21 should be superseded by a patch like discussed (or just have
> autovacuum = off added)...

Adding an extra FREEZE offers an extra insurance, so I don't see why
it would be a problem to add it to stabilize the horizon conflicts on
the standbys.

> 09.01.2024 07:59, Michael Paquier wrote:
> Bertrand, thank you for updating the patch!
> 
> Michael, it definitely increases stability of the test (tens of iterations
> with 20 tests in parallel performed successfully), although I've managed to
> see another interesting failure (twice):
> 13  #   Failed test 'activeslot slot invalidation is logged with vacuum 
> on pg_class'
> 13  #   at t/035_standby_logical_decoding.pl line 227.

Something I'd like to confirm here: you still see this failure with
the patch, but without an extra FREEZE, right?  If we do both, the
test would get more stable, wouldn't it?
--
Michael


signature.asc
Description: PGP signature


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-09 Thread Alexander Lakhin

Hello Michael and Bertrand,

I'd also like to note that even with FREEZE added [1], I happened to see
the test failure:
5   #   Failed test 'inactiveslot slot invalidation is logged with vacuum 
on pg_class'
5   #   at t/035_standby_logical_decoding.pl line 222.
5
5   #   Failed test 'activeslot slot invalidation is logged with vacuum on 
pg_class'
5   #   at t/035_standby_logical_decoding.pl line 227.

where 035_standby_logical_decoding_primary.log contains:
...
2024-01-09 07:44:26.480 UTC [820142] 035_standby_logical_decoding.pl LOG:  
statement: DROP TABLE conflict_test;
2024-01-09 07:44:26.687 UTC [820142] 035_standby_logical_decoding.pl LOG:  
statement: VACUUM (VERBOSE, FREEZE) pg_class;
2024-01-09 07:44:26.687 UTC [820142] 035_standby_logical_decoding.pl INFO:  aggressively vacuuming 
"testdb.pg_catalog.pg_class"

2024-01-09 07:44:27.099 UTC [820143] DEBUG:  autovacuum: processing database 
"testdb"
2024-01-09 07:44:27.102 UTC [820142] 035_standby_logical_decoding.pl INFO:  finished vacuuming 
"testdb.pg_catalog.pg_class": index scans: 1

    pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
    tuples: 0 removed, 423 remain, 4 are dead but not yet removable
    removable cutoff: 762, which was 2 XIDs old when operation ended
    new relfrozenxid: 762, which is 2 XIDs ahead of previous value
    frozen: 1 pages from table (9.09% of total) had 1 tuples frozen


Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
0174c2d21 should be superseded by a patch like discussed (or just have
autovacuum = off added)...

09.01.2024 07:59, Michael Paquier wrote:

Alexander, does the test gain in stability once you begin using the
patch posted on [2], mentioned by Bertrand?

(Also, perhaps we'd better move the discussion to the other thread
where the patch has been sent.)

[2]: 
https://www.postgresql.org/message-id/d40d015f-03a4-1cf2-6c1f-2b9aca860...@gmail.com


09.01.2024 08:29, Bertrand Drouvot wrote:

Alexander, pleae find attached v3 which is more or less a rebased version of it.


Bertrand, thank you for updating the patch!

Michael, it definitely increases stability of the test (tens of iterations
with 20 tests in parallel performed successfully), although I've managed to
see another interesting failure (twice):
13  #   Failed test 'activeslot slot invalidation is logged with vacuum on 
pg_class'
13  #   at t/035_standby_logical_decoding.pl line 227.

psql::1: INFO:  vacuuming "testdb.pg_catalog.pg_class"
psql::1: INFO:  finished vacuuming "testdb.pg_catalog.pg_class": index 
scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 419 remain, 0 are dead but not yet removable
removable cutoff: 754, which was 0 XIDs old when operation ended
...
Waiting for replication conn standby's replay_lsn to pass 0/403E6F8 on primary

But I see no VACUUM records in WAL:
rmgr: Transaction len (rec/tot):    222/   222, tx:  0, lsn: 0/0403E468, prev 0/0403E370, desc: INVALIDATION ; 
inval msgs: catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 relcache 
2662 relcache 2663 relcache 3455 relcache 1259
rmgr: Standby len (rec/tot):    234/   234, tx:  0, lsn: 0/0403E548, prev 0/0403E468, desc: INVALIDATIONS ; 
relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 
catcache 54 catcache 55 catcache 54 relcache 2662 relcache 2663 relcache 3455 relcache 1259
rmgr: Heap    len (rec/tot): 60/   140, tx:    754, lsn: 0/0403E638, prev 0/0403E548, desc: INSERT off: 2, 
flags: 0x08, blkref #0: rel 1663/16384/16385 blk 0 FPW
rmgr: Transaction len (rec/tot): 46/    46, tx:    754, lsn: 0/0403E6C8, prev 0/0403E638, desc: COMMIT 
2024-01-09 13:40:59.873385 UTC
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/0403E6F8, prev 0/0403E6C8, desc: RUNNING_XACTS 
nextXid 755 latestCompletedXid 754 oldestRunningXid 755

rmgr: XLOG    len (rec/tot): 30/    30, tx:  0, lsn: 
0/0403E730, prev 0/0403E6F8, desc: CHECKPOINT_REDO
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/0403E750, prev 0/0403E730, desc: RUNNING_XACTS 
nextXid 755 latestCompletedXid 754 oldestRunningXid 755
rmgr: XLOG    len (rec/tot):    114/   114, tx:  0, lsn: 0/0403E788, prev 0/0403E750, desc: 
CHECKPOINT_ONLINE redo 0/403E730; tli 1; prev tli 1; fpw true; xid 0:755; oid 24576; multi 1; offset 0; oldest xid 728 
in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 755; online
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/0403E800, prev 0/0403E788, desc: RUNNING_XACTS 
nextXid 755 latestCompletedXid 754 oldestRunningXid 755


(Full logs are attached.)

[1] 
https://www.postgresql.org/message-id/4fd52508-54d7-0202-5bd3-546c2295967f%40gmail.com

Best regards,
Alexander


Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2023-05-31 Thread Drouvot, Bertrand

Hi,

On 5/30/23 12:34 PM, Drouvot, Bertrand wrote:

Hi hackers,

Please find attached a patch proposal to $SUBJECT.

Indeed, we have seen occurrences in [1] that some slots were
not invalidated (while we expected vacuum to remove dead rows
leading to slots invalidation on the standby).

Though we don't have strong evidences that this
was due to transactions holding back global xmin (as vacuum did
not run in verbose mode), suspicion is high enough (as Tom pointed
out that the test is broken on its face (see [1])).

The proposed patch:

- set autovacuum = off on the primary (as autovacuum is the usual suspect
for holding global xmin).
- Ensure that vacuum is able to remove dead rows before launching
the slots invalidation tests.
- If after 10 attempts the vacuum is still not able to remove the dead
rows then the slots invalidation tests are skipped: that should be pretty
rare, as currently the majority of the tests are green (with only one attempt).

While at it, the patch also addresses the nitpicks mentioned by Robert in [2].

[1]: 
https://www.postgresql.org/message-id/flat/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9%40OSZPR01MB6310.jpnprd01.prod.outlook.com#71898e088d2a57564a1bd9c41f3e6f36
[2]: 
https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com



Please find attached V2 that, instead of the above proposal, waits for a new 
snapshot
that has a newer horizon before doing the vacuum (as proposed by Andres in [1]).

So, V2:

- set autovacuum = off on the primary (as autovacuum is the usual suspect
for holding global xmin).
- waits for a new snapshot that has a newer horizon before doing the vacuum(s).
- addresses the nitpicks mentioned by Robert in [2].

V2 also keeps the verbose mode for the vacuum(s) (as done in V1), as it may help
for further analysis if needed.

[1]: 
https://www.postgresql.org/message-id/20230530152426.ensapay7pozh7ozn%40alap3.anarazel.de
[2]: 
https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 383bfcf39257d4542b35ffe2ab56b43182ca2dea Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 30 May 2023 07:54:02 +
Subject: [PATCH v2] Ensure vacuum is able to remove dead rows in
 035_standby_logical_decoding

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation
on the standby.
---
 .../t/035_standby_logical_decoding.pl | 76 +--
 1 file changed, 37 insertions(+), 39 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index 64beec4bd3..bd426f3068 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -185,7 +185,7 @@ sub change_hot_standby_feedback_and_wait_for_xmins
 # Check conflicting status in pg_replication_slots.
 sub check_slots_conflicting_status
 {
-   my ($conflicting) = @_;
+   my ($conflicting, $details) = @_;
 
if ($conflicting)
{
@@ -193,7 +193,7 @@ sub check_slots_conflicting_status
'postgres', qq(
 select bool_and(conflicting) from 
pg_replication_slots;));
 
-   is($res, 't', "Logical slots are reported as conflicting");
+   is($res, 't', "logical slots are reported as conflicting 
$details");
}
else
{
@@ -201,7 +201,7 @@ sub check_slots_conflicting_status
'postgres', qq(
select bool_or(conflicting) from 
pg_replication_slots;));
 
-   is($res, 'f', "Logical slots are reported as non conflicting");
+   is($res, 'f', "logical slots are reported as non conflicting 
$details");
}
 }
 
@@ -256,6 +256,22 @@ sub check_for_invalidation
) or die "Timed out waiting confl_active_logicalslot to be updated";
 }
 
+# Launch $sql and wait for a new snapshot that has a newer horizon before
+# doing the vacuum with $vac_option on $to_vac.
+sub wait_until_vacuum_can_remove
+{
+   my ($vac_option, $sql, $to_vac) = @_;
+
+   my $xid = $node_primary->safe_psql('testdb', qq[$sql
+   
select txid_current();]);
+
+   $node_primary->poll_query_until('testdb',
+   "SELECT (select txid_snapshot_xmin(txid_current_snapshot()) - 
$xid) > 0")
+ or die "new snapshot does not have a newer horizon";
+
+   $node_primary->safe_psql('testdb', qq[VACUUM $vac_option verbose 
$to_vac;
+   
  INSERT INTO flush_wal DEFAULT VALUES; -- 

Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2023-05-30 Thread Drouvot, Bertrand

Hi hackers,

Please find attached a patch proposal to $SUBJECT.

Indeed, we have seen occurrences in [1] that some slots were
not invalidated (while we expected vacuum to remove dead rows
leading to slots invalidation on the standby).

Though we don't have strong evidences that this
was due to transactions holding back global xmin (as vacuum did
not run in verbose mode), suspicion is high enough (as Tom pointed
out that the test is broken on its face (see [1])).

The proposed patch:

- set autovacuum = off on the primary (as autovacuum is the usual suspect
for holding global xmin).
- Ensure that vacuum is able to remove dead rows before launching
the slots invalidation tests.
- If after 10 attempts the vacuum is still not able to remove the dead
rows then the slots invalidation tests are skipped: that should be pretty
rare, as currently the majority of the tests are green (with only one attempt).

While at it, the patch also addresses the nitpicks mentioned by Robert in [2].

[1]: 
https://www.postgresql.org/message-id/flat/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9%40OSZPR01MB6310.jpnprd01.prod.outlook.com#71898e088d2a57564a1bd9c41f3e6f36
[2]: 
https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom a5c6ef80cae1fb7606ff46679422238fdceb5cc8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 30 May 2023 07:54:02 +
Subject: [PATCH v1] Ensure vacuum did remove dead rows in
 035_standby_logical_decoding

We want to ensure that vacuum was able to remove dead rows (aka no other
transactions holding back global xmin) before testing for slots invalidation
on the standby.
---
 .../t/035_standby_logical_decoding.pl | 263 ++
 1 file changed, 150 insertions(+), 113 deletions(-)
 100.0% src/test/recovery/t/

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index 64beec4bd3..d26d9bcace 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -9,6 +9,7 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use Time::HiRes qw(usleep);
 
 my ($stdin, $stdout, $stderr,
$cascading_stdout, $cascading_stderr, $subscriber_stdin,
@@ -23,6 +24,9 @@ my $node_subscriber = 
PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
 my $psql_timeout = IPC::Run::timer($default_timeout);
 my $res;
+my ($vacret, $vacstdout, $vacstderr) = ('', '', '');
+my $nb_vac = 0;
+my $vac_attempts = 10;
 
 # Name for the physical slot on primary
 my $primary_slotname = 'primary_physical';
@@ -185,7 +189,7 @@ sub change_hot_standby_feedback_and_wait_for_xmins
 # Check conflicting status in pg_replication_slots.
 sub check_slots_conflicting_status
 {
-   my ($conflicting) = @_;
+   my ($conflicting, $details) = @_;
 
if ($conflicting)
{
@@ -193,7 +197,7 @@ sub check_slots_conflicting_status
'postgres', qq(
 select bool_and(conflicting) from 
pg_replication_slots;));
 
-   is($res, 't', "Logical slots are reported as conflicting");
+   is($res, 't', "logical slots are reported as conflicting 
$details");
}
else
{
@@ -201,7 +205,7 @@ sub check_slots_conflicting_status
'postgres', qq(
select bool_or(conflicting) from 
pg_replication_slots;));
 
-   is($res, 'f', "Logical slots are reported as non conflicting");
+   is($res, 'f', "logical slots are reported as non conflicting 
$details");
}
 }
 
@@ -256,6 +260,29 @@ sub check_for_invalidation
) or die "Timed out waiting confl_active_logicalslot to be updated";
 }
 
+# Try ($vac_attempts times max) to vacuum until it is able to remove dead rows.
+sub vacuum_until_can_remove
+{
+   my ($vac_option, $sql, $to_vac) = @_;
+
+   ($vacret, $vacstdout, $vacstderr) = ('', '', '');
+   $nb_vac = 0;
+
+   while ($nb_vac <= $vac_attempts)
+   {
+   last if ($vacstderr =~ /0 dead row versions cannot be removed 
yet/ ||
+$vacstderr =~ /0 are dead but not yet 
removable/);
+   # This should trigger the conflict
+   ($vacret, $vacstdout, $vacstderr) = $node_primary->psql(
+   'testdb', qq[
+ $sql
+ VACUUM $vac_option verbose $to_vac;
+ INSERT INTO flush_wal DEFAULT VALUES; -- see create 
table flush_wal
+   ]);
+   $nb_vac++;
+   usleep(100_000);
+   }
+}
 
 # Initialize primary node