Re: Spurious pgstat_drop_replslot() call

2024-03-12 Thread Michael Paquier
On Tue, Mar 12, 2024 at 12:00:00PM +0300, Alexander Lakhin wrote:
> Sorry for a bit off-topic, but I've remembered an anomaly with a similar
> assert:
> https://www.postgresql.org/message-id/17947-b9554521ad963c9c%40postgresql.org

Thanks for the reminder.  The invalidation path with the stats drop is
only in 16~.

> Maybe you would find it worth considering while working in this area...
> (I've just run that reproducer on b36fbd9f8 and confirmed that the
> assertion failure is still here.)

Indeed, something needs to happen.  I am not surprised that it still
reproduces; nothing has changed with the locking of the stats entries.
:/
--
Michael


signature.asc
Description: PGP signature


Re: Spurious pgstat_drop_replslot() call

2024-03-12 Thread Alexander Lakhin

Hello Bertrand and Michael,

12.03.2024 09:17, Bertrand Drouvot wrote:



May I
suggest the attached additions with LWLockHeldByMeInMode to make sure
that the stats are dropped and created while we hold
ReplicationSlotAllocationLock?

Yeah, makes fully sense and looks good to me.


Sorry for a bit off-topic, but I've remembered an anomaly with a similar
assert:
https://www.postgresql.org/message-id/17947-b9554521ad963c9c%40postgresql.org

Maybe you would find it worth considering while working in this area...
(I've just run that reproducer on b36fbd9f8 and confirmed that the
assertion failure is still here.)

Best regards,
Alexander




Re: Spurious pgstat_drop_replslot() call

2024-03-12 Thread Bertrand Drouvot
Hi,

On Tue, Mar 12, 2024 at 02:36:58PM +0900, Michael Paquier wrote:
> On Mon, Mar 11, 2024 at 04:15:40PM +0900, Michael Paquier wrote:
> > That's a slight change in behavior, unfortunately, and it cannot be
> > called a bug as this improves the visibility of the stats after an
> > invalidation, so this is not something that can be backpatched.
> 
> I've looked again at that and that was OK on a second look.

Thanks!

> May I
> suggest the attached additions with LWLockHeldByMeInMode to make sure
> that the stats are dropped and created while we hold
> ReplicationSlotAllocationLock?

Yeah, makes fully sense and looks good to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Spurious pgstat_drop_replslot() call

2024-03-11 Thread Michael Paquier
On Mon, Mar 11, 2024 at 04:15:40PM +0900, Michael Paquier wrote:
> That's a slight change in behavior, unfortunately, and it cannot be
> called a bug as this improves the visibility of the stats after an
> invalidation, so this is not something that can be backpatched.

I've looked again at that and that was OK on a second look.  May I
suggest the attached additions with LWLockHeldByMeInMode to make sure
that the stats are dropped and created while we hold
ReplicationSlotAllocationLock?
--
Michael
diff --git a/src/backend/utils/activity/pgstat_replslot.c b/src/backend/utils/activity/pgstat_replslot.c
index c61bad1627..889e86ac5a 100644
--- a/src/backend/utils/activity/pgstat_replslot.c
+++ b/src/backend/utils/activity/pgstat_replslot.c
@@ -113,6 +113,8 @@ pgstat_create_replslot(ReplicationSlot *slot)
 	PgStat_EntryRef *entry_ref;
 	PgStatShared_ReplSlot *shstatent;
 
+	Assert(LWLockHeldByMeInMode(ReplicationSlotAllocationLock, LW_EXCLUSIVE));
+
 	entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid,
 			ReplicationSlotIndex(slot), false);
 	shstatent = (PgStatShared_ReplSlot *) entry_ref->shared_stats;
@@ -153,6 +155,8 @@ pgstat_acquire_replslot(ReplicationSlot *slot)
 void
 pgstat_drop_replslot(ReplicationSlot *slot)
 {
+	Assert(LWLockHeldByMeInMode(ReplicationSlotAllocationLock, LW_EXCLUSIVE));
+
 	pgstat_drop_entry(PGSTAT_KIND_REPLSLOT, InvalidOid,
 	  ReplicationSlotIndex(slot));
 }


signature.asc
Description: PGP signature


Re: Spurious pgstat_drop_replslot() call

2024-03-11 Thread Bertrand Drouvot
Hi,

On Mon, Mar 11, 2024 at 04:15:40PM +0900, Michael Paquier wrote:
> That's a slight change in behavior, unfortunately, and it cannot be
> called a bug as this improves the visibility of the stats after an
> invalidation, so this is not something that can be backpatched.

Yeah, makes sense to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Spurious pgstat_drop_replslot() call

2024-03-11 Thread Michael Paquier
On Fri, Mar 08, 2024 at 03:04:10PM +, Bertrand Drouvot wrote:
> The switch in the patch from "drop" to "invalidation" is in [1], see:
> 
> "
> Given the precedent of max_slot_wal_keep_size, I think it's wrong to
just drop
> the logical slots. Instead we should just mark them as
invalid, 
> like InvalidateObsoleteReplicationSlots().
> 
> Makes fully sense and done that way in the attached patch.
> “
> 
> But yeah, hard to be sure why this call is there, at least I don't remember...

Yep, noticed that on Friday.

> We can not be 100% sure that the stats are up to date when the process holding
> the active slot is killed. 
> 
> So v2 attached adds a test where we ensure first that we have non empty stats
> before triggering the invalidation.

Ah, that explains the extra poll.  What you have done here makes sense
to me, and the new test fails immediately if I add back the stats drop
in the invalidation code path.

That's a slight change in behavior, unfortunately, and it cannot be
called a bug as this improves the visibility of the stats after an
invalidation, so this is not something that can be backpatched.
--
Michael


signature.asc
Description: PGP signature


Re: Spurious pgstat_drop_replslot() call

2024-03-08 Thread Bertrand Drouvot
Hi,

On Fri, Mar 08, 2024 at 07:55:39PM +0900, Michael Paquier wrote:
> On Fri, Mar 08, 2024 at 10:19:11AM +, Bertrand Drouvot wrote:
> > Indeed, it does not seem appropriate to remove stats during slot 
> > invalidation as
> > one could still be interested to look at them.
> 
> Yeah, my take is that this can still be interesting to know, at least
> for debugging.  This would limit the stats to be dropped when the slot
> is dropped, and that looks like a sound idea.

Thanks for looking at it!

> > This spurious call has been introduced in be87200efd. I think that 
> > initially we
> > designed to drop slots when a recovery conflict occurred during logical 
> > decoding
> > on a standby. But we changed our mind to invalidate such a slot instead.
> > 
> > The spurious call is probably due to the initial design but has not been 
> > removed.
> 
> This is not a subject that has really been touched on the original
> thread mentioned on the commit, so it is a bit hard to be sure.  The
> only comment is that a previous version of the patch did the stats
> drop in the slot invalidation path at an incorrect location.

The switch in the patch from "drop" to "invalidation" is in [1], see:

"
Given the precedent of max_slot_wal_keep_size, I think it's wrong to
just drop
the logical slots. Instead we should just mark them as
invalid, 
like InvalidateObsoleteReplicationSlots().

Makes fully sense and done that way in the attached patch.
“

But yeah, hard to be sure why this call is there, at least I don't remember...

> > I don't think it's worth to add a test but can do if one feel the need.
> 
> Well, that would not be complicated while being cheap, no?  We have a
> few paths in the 035 test where we know that a slot has been
> invalidated so it is just a matter of querying once
> pg_stat_replication_slot and see if some data is still there.

We can not be 100% sure that the stats are up to date when the process holding
the active slot is killed. 

So v2 attached adds a test where we ensure first that we have non empty stats
before triggering the invalidation.

[1]: 
https://www.postgresql.org/message-id/26c6f320-98f0-253c-f8b5-df1e7c1f6315%40amazon.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From b97a42edd66ccb7f8f4d5d2fa24df0b02b6f4f68 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 8 Mar 2024 09:29:29 +
Subject: [PATCH v2] Remove spurious pgstat_drop_replslot() call

There is no need to remove stats for an invalidated slot as one could still
be interested to look at them.
---
 src/backend/replication/slot.c |  1 -
 .../recovery/t/035_standby_logical_decoding.pl | 18 ++
 2 files changed, 18 insertions(+), 1 deletion(-)
   3.5% src/backend/replication/
  96.4% src/test/recovery/t/

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b8bf98b182..91ca397857 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1726,7 +1726,6 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
-			pgstat_drop_replslot(s);
 
 			ReportSlotInvalidation(conflict, false, active_pid,
    slotname, restart_lsn,
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 0710bccc17..855f37016e 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -494,6 +494,8 @@ $node_subscriber->stop;
 ##
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
+# In passing, ensure that replication slot stats are not removed when the active
+# slot is invalidated.
 ##
 
 # One way to produce recovery conflict is to create/drop a relation and
@@ -502,6 +504,15 @@ $node_subscriber->stop;
 reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 	0, 1);
 
+# Ensure replication slot stats are not empty before triggering the conflict
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]
+);
+
+$node_standby->poll_query_until('testdb',
+	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
+) or die "slot stat total_txns not > 0 for vacuum_full_activeslot";
+
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -515,6 +526,13 @@ check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 # Verify conflict_reason is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
+# Ensure replication slot stats 

Re: Spurious pgstat_drop_replslot() call

2024-03-08 Thread Michael Paquier
On Fri, Mar 08, 2024 at 10:19:11AM +, Bertrand Drouvot wrote:
> Indeed, it does not seem appropriate to remove stats during slot invalidation 
> as
> one could still be interested to look at them.

Yeah, my take is that this can still be interesting to know, at least
for debugging.  This would limit the stats to be dropped when the slot
is dropped, and that looks like a sound idea.

> This spurious call has been introduced in be87200efd. I think that initially 
> we
> designed to drop slots when a recovery conflict occurred during logical 
> decoding
> on a standby. But we changed our mind to invalidate such a slot instead.
> 
> The spurious call is probably due to the initial design but has not been 
> removed.

This is not a subject that has really been touched on the original
thread mentioned on the commit, so it is a bit hard to be sure.  The
only comment is that a previous version of the patch did the stats
drop in the slot invalidation path at an incorrect location.

> I don't think it's worth to add a test but can do if one feel the need.

Well, that would not be complicated while being cheap, no?  We have a
few paths in the 035 test where we know that a slot has been
invalidated so it is just a matter of querying once
pg_stat_replication_slot and see if some data is still there.
--
Michael


signature.asc
Description: PGP signature