Re: BF mamba failure

2024-06-16 Thread Michael Paquier
On Fri, Jun 14, 2024 at 02:31:37PM +0900, Michael Paquier wrote:
> I don't think that this is going to fly far except if we introduce a
> concept of "generation" or "age" in the stats entries.  The idea is
> simple: when a stats entry is reinitialized because of a drop&create,
> increment a counter to tell that this is a new generation, and keep
> track of it in *both* PgStat_EntryRef (local backend reference to the
> shmem stats entry) *and* PgStatShared_HashEntry (the central one).
> When releasing an entry, if we know that the shared entry we are
> pointing at is not of the same generation as the local reference, it
> means that the entry has been reused for something else with the same
> hash key, so give up.  It should not be that invasive, still it means
> ABI breakage in the two pgstats internal structures I am mentioning,
> which is OK for a backpatch as this stuff is internal.  On top of
> that, this problem means that we can silently and randomly lose stats,
> which is not cool.
> 
> I'll try to give it a go on Monday.

Here you go, the patch introduces what I've named an "age" counter
attached to the shared entry references, and copied over to the local
references.  The countner is initialized at 0 and incremented each
time an entry is reused, then when attempting to drop an entry we
cross-check the version hold locally with the shared one.

While looking at the whole, this is close to a concept patch sent
previously, where a counter is used in the shared entry with a
cross-check done with the local reference, that was posted here
(noticed that today):
https://www.postgresql.org/message-id/20230603.063418.767871221863527769.horikyota@gmail.com

The logic is different though, as we don't need to care about the
contents of the local cache when cross-checking the "age" count when
retrieving the contents, just the case where a backend would attempt
to drop an entry it thinks is OK to operate on, that got reused
because of the effect of other backends doing creates and drops with
the same hash key.

This idea needs more eyes, so I am adding that to the next CF for now.
I've played with it for a few hours and concurrent replication slot
drops/creates, without breaking it.  I have not implemented an
isolation test for this case, as it depends on where we are going with
their integration with injection points.
--
Michael
From 3d9a8b6913c0ce61d800d3b31f392230580e99fc Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 17 Jun 2024 13:22:33 +0900
Subject: [PATCH] Introduce "agecount" for shared pgstats

This fixes a set of race conditions with cumulative statistics where a
shared stats entry could be dropped while it should still be valid
when it gets reused.  This can happen for the following case, causing
the same hash key to be used:
- replication slots that compute internally an index number.
- other object types, with a wraparound involved.

The age counter is tracked in the shared entries, initialized at 0 when
an entry is created and incremented when the same entry is reused, to
avoid concurrent turbulences on drop because of other backends still
holding a reference to it.
---
 src/include/utils/pgstat_internal.h   | 14 
 src/backend/utils/activity/pgstat_shmem.c | 40 ---
 2 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index dbbca31602..2ea29b7638 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -93,6 +93,14 @@ typedef struct PgStatShared_HashEntry
 	 */
 	pg_atomic_uint32 refcount;
 
+	/*
+	 * Counter tracking the number of times the entry has been reused.
+	 *
+	 * Set to 0 when the entry is created, and incremented by one each time
+	 * the shared entry is reinitialized with pgstat_reinit_entry().
+	 */
+	pg_atomic_uint32 agecount;
+
 	/*
 	 * Pointer to shared stats. The stats entry always starts with
 	 * PgStatShared_Common, embedded in a larger struct containing the
@@ -132,6 +140,12 @@ typedef struct PgStat_EntryRef
 	 */
 	PgStatShared_Common *shared_stats;
 
+	/*
+	 * Copy of PgStatShared_HashEntry->agecount, keeping locally track of the
+	 * shared stats entry "age" retrieved (number of times reused).
+	 */
+	uint32		agecount;
+
 	/*
 	 * Pending statistics data that will need to be flushed to shared memory
 	 * stats eventually. Each stats kind utilizing pending data defines what
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index fb79c5e771..a9de1985a4 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -270,6 +270,11 @@ pgstat_init_entry(PgStat_Kind kind,
 	 * further if a longer lived reference is needed.
 	 */
 	pg_atomic_init_u32(&shhashent->refcount, 1);
+
+	/*
+	 * Initialize agecount to 0, as freshly initialized.
+	 */
+	pg_atomic_init_u32(&shhashent->agecount, 0);
 	shhashent->dropped = false;
 
 	

Re: BF mamba failure

2024-06-13 Thread Michael Paquier
On Wed, Jun 12, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> With extra logging added, I see the following events happening:
> 1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets
>   ReplicationSlotIndex(slot) = 0 and calls
>   pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0).
> 
> 2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...)
>   and then calls pgstat_gc_entry_refs on shmem_exit() ->
>   pgstat_shutdown_hook() ...;
>   with the sleep added inside pgstat_release_entry_ref, this backend waits
>   after decreasing entry_ref->shared_entry->refcount to 0.
> 
> 3) pgstat_rc_1.stop removes the replication slot.
> 
> 4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets
>   ReplicationSlotIndex(slot) = 0 and calls
>   pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0),
>   which leads to the call pgstat_reinit_entry(), which increases refcount
>   for the same shared_entry as in (1) and (2), and then to the call
>   pgstat_acquire_entry_ref(), which increases refcount once more.
> 
> 5) the backend 2 reaches
> Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0),
>   which fails due to refcount = 2.

Thanks for the details.

So this comes down to the point that we offer no guarantee that the
stats entry a backend sees at shutdown is the same as the one it wants
to clean up.  That's the same problem as what Floris has reported
here, with an OID wraparound and tables to get the same hash key.
That can happen for all stats kinds:
https://www.postgresql.org/message-id/b14ae28029f64757bb64613be2549...@optiver.com

I don't think that this is going to fly far except if we introduce a
concept of "generation" or "age" in the stats entries.  The idea is
simple: when a stats entry is reinitialized because of a drop&create,
increment a counter to tell that this is a new generation, and keep
track of it in *both* PgStat_EntryRef (local backend reference to the
shmem stats entry) *and* PgStatShared_HashEntry (the central one).
When releasing an entry, if we know that the shared entry we are
pointing at is not of the same generation as the local reference, it
means that the entry has been reused for something else with the same
hash key, so give up.  It should not be that invasive, still it means
ABI breakage in the two pgstats internal structures I am mentioning,
which is OK for a backpatch as this stuff is internal.  On top of
that, this problem means that we can silently and randomly lose stats,
which is not cool.

Note that Noah has been working on a better integration of injection
points with isolation tests.  We could reuse that here to have a test
case, with an injection point waiting around the pg_usleep() you have
hardcoded:
https://www.postgresql.org/message-id/20240614003549.c2.nmi...@google.com

I'll try to give it a go on Monday.
--
Michael


signature.asc
Description: PGP signature


Re: BF mamba failure

2024-06-12 Thread Alexander Lakhin

Hello hackers,

20.03.2023 09:10, Peter Smith wrote:


Using this I was also able to reproduce the problem. But test failures
were rare. The make check-world seemed OK, and indeed the
test_decoding tests would also appear to PASS around 14 out of 15
times.


I've stumbled upon this assertion failure again during testing following 
cd312adc5.

This time I've simplified the reproducer to the attached modification.
With this patch applied, `make -s check -C contrib/test_decoding` fails on 
master as below:
ok 1 - pgstat_rc_1    14 ms
not ok 2 - pgstat_rc_2  1351 ms


contrib/test_decoding/output_iso/log/postmaster.log contains:
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 562, 
PID: 1130928


With extra logging added, I see the following events happening:
1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets
  ReplicationSlotIndex(slot) = 0 and calls
  pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0).

2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...)
  and then calls pgstat_gc_entry_refs on shmem_exit() ->
  pgstat_shutdown_hook() ...;
  with the sleep added inside pgstat_release_entry_ref, this backend waits
  after decreasing entry_ref->shared_entry->refcount to 0.

3) pgstat_rc_1.stop removes the replication slot.

4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets
  ReplicationSlotIndex(slot) = 0 and calls
  pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0),
  which leads to the call pgstat_reinit_entry(), which increases refcount
  for the same shared_entry as in (1) and (2), and then to the call
  pgstat_acquire_entry_ref(), which increases refcount once more.

5) the backend 2 reaches
Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0),
  which fails due to refcount = 2.

Best regards,
Alexanderdiff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c7ce603706..660eea87eb 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -6,10 +6,12 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate stream stats twophase twophase_stream
-ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
+XISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot
 
+ISOLATION = pgstat_rc_1 pgstat_rc_2
+
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 
diff --git a/contrib/test_decoding/expected/pgstat_rc_1.out b/contrib/test_decoding/expected/pgstat_rc_1.out
new file mode 100644
index 00..850adf9930
--- /dev/null
+++ b/contrib/test_decoding/expected/pgstat_rc_1.out
@@ -0,0 +1,18 @@
+Parsed test spec with 1 sessions
+
+starting permutation: s0_get_changes
+?column?
+
+init
+(1 row)
+
+step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data
+
+(0 rows)
+
+?column?
+
+stop
+(1 row)
+
diff --git a/contrib/test_decoding/expected/pgstat_rc_2.out b/contrib/test_decoding/expected/pgstat_rc_2.out
new file mode 100644
index 00..f8d4ce83a6
--- /dev/null
+++ b/contrib/test_decoding/expected/pgstat_rc_2.out
@@ -0,0 +1,19 @@
+Parsed test spec with 1 sessions
+
+starting permutation: s0_sleep
+?column?
+
+init
+(1 row)
+
+step s0_sleep: SELECT pg_sleep(1);
+pg_sleep
+
+
+(1 row)
+
+?column?
+
+stop
+(1 row)
+
diff --git a/contrib/test_decoding/specs/pgstat_rc_1.spec b/contrib/test_decoding/specs/pgstat_rc_1.spec
new file mode 100644
index 00..7b08704a55
--- /dev/null
+++ b/contrib/test_decoding/specs/pgstat_rc_1.spec
@@ -0,0 +1,14 @@
+setup
+{
+SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write in xact
+}
+
+teardown
+{
+SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s0"
+step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+permutation "s0_get_changes"
diff --git a/contrib/test_decoding/specs/pgstat_rc_2.spec b/contrib/test_decoding/specs/pgstat_rc_2.spec
new file mode 100644
index 00..4e454ae8e8
--- /dev/null
+++ b/contrib/test_decoding/specs/pgstat_rc_2.spec
@@ -0,0 +1,14 @@
+setup
+{
+SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write in xact
+}
+
+teardown
+{
+	SELECT 

Re: BF mamba failure

2023-03-19 Thread Peter Smith
On Sun, Mar 19, 2023 at 2:00 AM Alexander Lakhin  wrote:
>
> Hi,
>
> 18.03.2023 07:26, Tom Lane wrote:
>
> Amit Kapila  writes:
>
> Peter Smith has recently reported a BF failure [1]. AFAICS, the call
> stack of failure [2] is as follows:
>
> Note the assertion report a few lines further up:
>
> TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) 
> == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004
>
>
> This assertion failure can be reproduced easily with the attached patch:
> == running regression test queries==
> test oldest_xmin  ... ok   55 ms
> test oldest_xmin  ... FAILED (test process exited with exit 
> code 1)  107 ms
> test oldest_xmin  ... FAILED (test process exited with exit 
> code 1)8 ms
> == shutting down postmaster   ==
>
> contrib/test_decoding/output_iso/log/postmaster.log contains:
> TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) 
> == 0"), File: "pgstat_shmem.c", Line: 561, PID: 456844
>
> With the sleep placed above Assert(entry_ref->shared_entry->dropped) this 
> Assert fails too.
>
> Best regards,
> Alexander

I used a slightly modified* patch of Alexander's [1] applied to the
latest HEAD code (but with my "toptxn" patch reverted).
--- the patch was modified in that I injected 'sleep' both above and
below the Assert(entry_ref->shared_entry->dropped).

Using this I was also able to reproduce the problem. But test failures
were rare. The make check-world seemed OK, and indeed the
test_decoding tests would also appear to PASS around 14 out of 15
times.

== running regression test queries==
test oldest_xmin  ... ok  342 ms
test oldest_xmin  ... ok  121 ms
test oldest_xmin  ... ok  283 ms
== shutting down postmaster   ==
== removing temporary instance==

=
 All 3 tests passed.
=

~~

Often (but not always) depite the test_decoding reported PASS all 3
tests as "ok", I still observed there was a TRAP in the logfile
(contrib/test_decoding/output_iso/log/postmaster.log).
TRAP: failed Assert("entry_ref->shared_entry->dropped")

~~

Occasionally (about 1 in 15 test runs) the test would fail the same
way as described by Alexander [1], with the accompanying TRAP.
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount)
== 0"), File: "pgstat_shmem.c", Line: 562, PID: 32013

== running regression test queries==
test oldest_xmin  ... ok  331 ms
test oldest_xmin  ... ok   91 ms
test oldest_xmin  ... FAILED  702 ms
== shutting down postmaster   ==

==
 1 of 3 tests failed.
==


~~


FWIW, the "toptxn" patch. whose push coincided with the build-farm
error I first reported [2], turns out to be an innocent party in this
TRAP. We know this because all of the above results were running using
HEAD code but with that "toptxn" patch reverted.

--
[1] 
https://www.postgresql.org/message-id/1941b7e2-be7c-9c4c-8505-c0fd05910e9a%40gmail.com
[2] 
https://www.postgresql.org/message-id/CAHut%2BPsHdWFjU43VEX%2BR-8de6dFQ-_JWrsqs%3DvWek1hULexP4Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: BF mamba failure

2023-03-18 Thread Alexander Lakhin

Hi,

18.03.2023 07:26, Tom Lane wrote:

Amit Kapila  writes:

Peter Smith has recently reported a BF failure [1]. AFAICS, the call
stack of failure [2] is as follows:

Note the assertion report a few lines further up:

TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: 
"pgstat_shmem.c", Line: 560, PID: 25004


This assertion failure can be reproduced easily with the attached patch:
== running regression test queries    ==
test oldest_xmin  ... ok   55 ms
test oldest_xmin  ... FAILED (test process exited with exit 
code 1)  107 ms
test oldest_xmin  ... FAILED (test process exited with exit 
code 1)    8 ms
== shutting down postmaster   ==

contrib/test_decoding/output_iso/log/postmaster.log contains:
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 561, 
PID: 456844


With the sleep placed above Assert(entry_ref->shared_entry->dropped) this 
Assert fails too.

Best regards,
Alexanderdiff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index c7ce603706..9b3389a90e 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -9,6 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot slot_creation_error catalog_change_snapshot
+ISOLATION = oldest_xmin oldest_xmin oldest_xmin
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 09fffd0e82..5307116670 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -551,6 +551,7 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref,
 			/* only dropped entries can reach a 0 refcount */
 			Assert(entry_ref->shared_entry->dropped);
 
+pg_usleep(10L);
 			shent = dshash_find(pgStatLocal.shared_hash,
 &entry_ref->shared_entry->key,
 true);


Re: BF mamba failure

2023-03-17 Thread Tom Lane
Amit Kapila  writes:
> Peter Smith has recently reported a BF failure [1]. AFAICS, the call
> stack of failure [2] is as follows:

Note the assertion report a few lines further up:

TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 
0"), File: "pgstat_shmem.c", Line: 560, PID: 25004

regards, tom lane




BF mamba failure

2023-03-17 Thread Amit Kapila
Hi,

Peter Smith has recently reported a BF failure [1]. AFAICS, the call
stack of failure [2] is as follows:

0x1e66644  at postgres
0x1d0143c  at postgres
0x1d02534  at postgres
0x1cfb120  at postgres
0x1cfd590  at postgres
0x1cfbfc0  at postgres
0x1ca7b08  at postgres
0x1ca7c74  at postgres
0x1ca7d2c  at postgres
0x1cdf060  at postgres
0x1c203f4  at postgres
0x1c2161c  at postgres
0x1edcf94  at postgres

I couldn't correlate it to the recent commits. Any thoughts?

[1] - 
https://www.postgresql.org/message-id/CAHut%2BPsHdWFjU43VEX%2BR-8de6dFQ-_JWrsqs%3DvWek1hULexP4Q%40mail.gmail.com
[2] -
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-03-17%2005%3A36%3A10

-- 
With Regards,
Amit Kapila.