Re: BF mamba failure
On Fri, Jun 14, 2024 at 02:31:37PM +0900, Michael Paquier wrote: > I don't think that this is going to fly far except if we introduce a > concept of "generation" or "age" in the stats entries. The idea is > simple: when a stats entry is reinitialized because of a drop&create, > increment a counter to tell that this is a new generation, and keep > track of it in *both* PgStat_EntryRef (local backend reference to the > shmem stats entry) *and* PgStatShared_HashEntry (the central one). > When releasing an entry, if we know that the shared entry we are > pointing at is not of the same generation as the local reference, it > means that the entry has been reused for something else with the same > hash key, so give up. It should not be that invasive, still it means > ABI breakage in the two pgstats internal structures I am mentioning, > which is OK for a backpatch as this stuff is internal. On top of > that, this problem means that we can silently and randomly lose stats, > which is not cool. > > I'll try to give it a go on Monday. Here you go, the patch introduces what I've named an "age" counter attached to the shared entry references, and copied over to the local references. The countner is initialized at 0 and incremented each time an entry is reused, then when attempting to drop an entry we cross-check the version hold locally with the shared one. While looking at the whole, this is close to a concept patch sent previously, where a counter is used in the shared entry with a cross-check done with the local reference, that was posted here (noticed that today): https://www.postgresql.org/message-id/20230603.063418.767871221863527769.horikyota@gmail.com The logic is different though, as we don't need to care about the contents of the local cache when cross-checking the "age" count when retrieving the contents, just the case where a backend would attempt to drop an entry it thinks is OK to operate on, that got reused because of the effect of other backends doing creates and drops with the same hash key. This idea needs more eyes, so I am adding that to the next CF for now. I've played with it for a few hours and concurrent replication slot drops/creates, without breaking it. I have not implemented an isolation test for this case, as it depends on where we are going with their integration with injection points. -- Michael From 3d9a8b6913c0ce61d800d3b31f392230580e99fc Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 17 Jun 2024 13:22:33 +0900 Subject: [PATCH] Introduce "agecount" for shared pgstats This fixes a set of race conditions with cumulative statistics where a shared stats entry could be dropped while it should still be valid when it gets reused. This can happen for the following case, causing the same hash key to be used: - replication slots that compute internally an index number. - other object types, with a wraparound involved. The age counter is tracked in the shared entries, initialized at 0 when an entry is created and incremented when the same entry is reused, to avoid concurrent turbulences on drop because of other backends still holding a reference to it. --- src/include/utils/pgstat_internal.h | 14 src/backend/utils/activity/pgstat_shmem.c | 40 --- 2 files changed, 49 insertions(+), 5 deletions(-) diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h index dbbca31602..2ea29b7638 100644 --- a/src/include/utils/pgstat_internal.h +++ b/src/include/utils/pgstat_internal.h @@ -93,6 +93,14 @@ typedef struct PgStatShared_HashEntry */ pg_atomic_uint32 refcount; + /* + * Counter tracking the number of times the entry has been reused. + * + * Set to 0 when the entry is created, and incremented by one each time + * the shared entry is reinitialized with pgstat_reinit_entry(). + */ + pg_atomic_uint32 agecount; + /* * Pointer to shared stats. The stats entry always starts with * PgStatShared_Common, embedded in a larger struct containing the @@ -132,6 +140,12 @@ typedef struct PgStat_EntryRef */ PgStatShared_Common *shared_stats; + /* + * Copy of PgStatShared_HashEntry->agecount, keeping locally track of the + * shared stats entry "age" retrieved (number of times reused). + */ + uint32 agecount; + /* * Pending statistics data that will need to be flushed to shared memory * stats eventually. Each stats kind utilizing pending data defines what diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index fb79c5e771..a9de1985a4 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -270,6 +270,11 @@ pgstat_init_entry(PgStat_Kind kind, * further if a longer lived reference is needed. */ pg_atomic_init_u32(&shhashent->refcount, 1); + + /* + * Initialize agecount to 0, as freshly initialized. + */ + pg_atomic_init_u32(&shhashent->agecount, 0); shhashent->dropped = false;
Re: BF mamba failure
On Wed, Jun 12, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote: > With extra logging added, I see the following events happening: > 1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets > ReplicationSlotIndex(slot) = 0 and calls > pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0). > > 2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...) > and then calls pgstat_gc_entry_refs on shmem_exit() -> > pgstat_shutdown_hook() ...; > with the sleep added inside pgstat_release_entry_ref, this backend waits > after decreasing entry_ref->shared_entry->refcount to 0. > > 3) pgstat_rc_1.stop removes the replication slot. > > 4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets > ReplicationSlotIndex(slot) = 0 and calls > pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0), > which leads to the call pgstat_reinit_entry(), which increases refcount > for the same shared_entry as in (1) and (2), and then to the call > pgstat_acquire_entry_ref(), which increases refcount once more. > > 5) the backend 2 reaches > Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0), > which fails due to refcount = 2. Thanks for the details. So this comes down to the point that we offer no guarantee that the stats entry a backend sees at shutdown is the same as the one it wants to clean up. That's the same problem as what Floris has reported here, with an OID wraparound and tables to get the same hash key. That can happen for all stats kinds: https://www.postgresql.org/message-id/b14ae28029f64757bb64613be2549...@optiver.com I don't think that this is going to fly far except if we introduce a concept of "generation" or "age" in the stats entries. The idea is simple: when a stats entry is reinitialized because of a drop&create, increment a counter to tell that this is a new generation, and keep track of it in *both* PgStat_EntryRef (local backend reference to the shmem stats entry) *and* PgStatShared_HashEntry (the central one). When releasing an entry, if we know that the shared entry we are pointing at is not of the same generation as the local reference, it means that the entry has been reused for something else with the same hash key, so give up. It should not be that invasive, still it means ABI breakage in the two pgstats internal structures I am mentioning, which is OK for a backpatch as this stuff is internal. On top of that, this problem means that we can silently and randomly lose stats, which is not cool. Note that Noah has been working on a better integration of injection points with isolation tests. We could reuse that here to have a test case, with an injection point waiting around the pg_usleep() you have hardcoded: https://www.postgresql.org/message-id/20240614003549.c2.nmi...@google.com I'll try to give it a go on Monday. -- Michael signature.asc Description: PGP signature
Re: BF mamba failure
Hello hackers, 20.03.2023 09:10, Peter Smith wrote: Using this I was also able to reproduce the problem. But test failures were rare. The make check-world seemed OK, and indeed the test_decoding tests would also appear to PASS around 14 out of 15 times. I've stumbled upon this assertion failure again during testing following cd312adc5. This time I've simplified the reproducer to the attached modification. With this patch applied, `make -s check -C contrib/test_decoding` fails on master as below: ok 1 - pgstat_rc_1 14 ms not ok 2 - pgstat_rc_2 1351 ms contrib/test_decoding/output_iso/log/postmaster.log contains: TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 562, PID: 1130928 With extra logging added, I see the following events happening: 1) pgstat_rc_1.setup calls pgstat_create_replslot(), gets ReplicationSlotIndex(slot) = 0 and calls pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0). 2) pgstat_rc_1.s0_get_changes executes pg_logical_slot_get_changes(...) and then calls pgstat_gc_entry_refs on shmem_exit() -> pgstat_shutdown_hook() ...; with the sleep added inside pgstat_release_entry_ref, this backend waits after decreasing entry_ref->shared_entry->refcount to 0. 3) pgstat_rc_1.stop removes the replication slot. 4) pgstat_rc_2.setup calls pgstat_create_replslot(), gets ReplicationSlotIndex(slot) = 0 and calls pgstat_get_entry_ref_locked(PGSTAT_KIND_REPLSLOT, InvalidOid, 0, 0), which leads to the call pgstat_reinit_entry(), which increases refcount for the same shared_entry as in (1) and (2), and then to the call pgstat_acquire_entry_ref(), which increases refcount once more. 5) the backend 2 reaches Assert(pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0), which fails due to refcount = 2. Best regards, Alexanderdiff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index c7ce603706..660eea87eb 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -6,10 +6,12 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin" REGRESS = ddl xact rewrite toast permissions decoding_in_xact \ decoding_into_rel binary prepared replorigin time messages \ spill slot truncate stream stats twophase twophase_stream -ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \ +XISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \ oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \ twophase_snapshot slot_creation_error catalog_change_snapshot +ISOLATION = pgstat_rc_1 pgstat_rc_2 + REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf diff --git a/contrib/test_decoding/expected/pgstat_rc_1.out b/contrib/test_decoding/expected/pgstat_rc_1.out new file mode 100644 index 00..850adf9930 --- /dev/null +++ b/contrib/test_decoding/expected/pgstat_rc_1.out @@ -0,0 +1,18 @@ +Parsed test spec with 1 sessions + +starting permutation: s0_get_changes +?column? + +init +(1 row) + +step s0_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); +data + +(0 rows) + +?column? + +stop +(1 row) + diff --git a/contrib/test_decoding/expected/pgstat_rc_2.out b/contrib/test_decoding/expected/pgstat_rc_2.out new file mode 100644 index 00..f8d4ce83a6 --- /dev/null +++ b/contrib/test_decoding/expected/pgstat_rc_2.out @@ -0,0 +1,19 @@ +Parsed test spec with 1 sessions + +starting permutation: s0_sleep +?column? + +init +(1 row) + +step s0_sleep: SELECT pg_sleep(1); +pg_sleep + + +(1 row) + +?column? + +stop +(1 row) + diff --git a/contrib/test_decoding/specs/pgstat_rc_1.spec b/contrib/test_decoding/specs/pgstat_rc_1.spec new file mode 100644 index 00..7b08704a55 --- /dev/null +++ b/contrib/test_decoding/specs/pgstat_rc_1.spec @@ -0,0 +1,14 @@ +setup +{ +SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write in xact +} + +teardown +{ +SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot'); +} + +session "s0" +step "s0_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); } + +permutation "s0_get_changes" diff --git a/contrib/test_decoding/specs/pgstat_rc_2.spec b/contrib/test_decoding/specs/pgstat_rc_2.spec new file mode 100644 index 00..4e454ae8e8 --- /dev/null +++ b/contrib/test_decoding/specs/pgstat_rc_2.spec @@ -0,0 +1,14 @@ +setup +{ +SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); -- must be first write in xact +} + +teardown +{ + SELECT
Re: BF mamba failure
On Sun, Mar 19, 2023 at 2:00 AM Alexander Lakhin wrote: > > Hi, > > 18.03.2023 07:26, Tom Lane wrote: > > Amit Kapila writes: > > Peter Smith has recently reported a BF failure [1]. AFAICS, the call > stack of failure [2] is as follows: > > Note the assertion report a few lines further up: > > TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) > == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004 > > > This assertion failure can be reproduced easily with the attached patch: > == running regression test queries== > test oldest_xmin ... ok 55 ms > test oldest_xmin ... FAILED (test process exited with exit > code 1) 107 ms > test oldest_xmin ... FAILED (test process exited with exit > code 1)8 ms > == shutting down postmaster == > > contrib/test_decoding/output_iso/log/postmaster.log contains: > TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) > == 0"), File: "pgstat_shmem.c", Line: 561, PID: 456844 > > With the sleep placed above Assert(entry_ref->shared_entry->dropped) this > Assert fails too. > > Best regards, > Alexander I used a slightly modified* patch of Alexander's [1] applied to the latest HEAD code (but with my "toptxn" patch reverted). --- the patch was modified in that I injected 'sleep' both above and below the Assert(entry_ref->shared_entry->dropped). Using this I was also able to reproduce the problem. But test failures were rare. The make check-world seemed OK, and indeed the test_decoding tests would also appear to PASS around 14 out of 15 times. == running regression test queries== test oldest_xmin ... ok 342 ms test oldest_xmin ... ok 121 ms test oldest_xmin ... ok 283 ms == shutting down postmaster == == removing temporary instance== = All 3 tests passed. = ~~ Often (but not always) depite the test_decoding reported PASS all 3 tests as "ok", I still observed there was a TRAP in the logfile (contrib/test_decoding/output_iso/log/postmaster.log). TRAP: failed Assert("entry_ref->shared_entry->dropped") ~~ Occasionally (about 1 in 15 test runs) the test would fail the same way as described by Alexander [1], with the accompanying TRAP. TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 562, PID: 32013 == running regression test queries== test oldest_xmin ... ok 331 ms test oldest_xmin ... ok 91 ms test oldest_xmin ... FAILED 702 ms == shutting down postmaster == == 1 of 3 tests failed. == ~~ FWIW, the "toptxn" patch. whose push coincided with the build-farm error I first reported [2], turns out to be an innocent party in this TRAP. We know this because all of the above results were running using HEAD code but with that "toptxn" patch reverted. -- [1] https://www.postgresql.org/message-id/1941b7e2-be7c-9c4c-8505-c0fd05910e9a%40gmail.com [2] https://www.postgresql.org/message-id/CAHut%2BPsHdWFjU43VEX%2BR-8de6dFQ-_JWrsqs%3DvWek1hULexP4Q%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: BF mamba failure
Hi, 18.03.2023 07:26, Tom Lane wrote: Amit Kapila writes: Peter Smith has recently reported a BF failure [1]. AFAICS, the call stack of failure [2] is as follows: Note the assertion report a few lines further up: TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004 This assertion failure can be reproduced easily with the attached patch: == running regression test queries == test oldest_xmin ... ok 55 ms test oldest_xmin ... FAILED (test process exited with exit code 1) 107 ms test oldest_xmin ... FAILED (test process exited with exit code 1) 8 ms == shutting down postmaster == contrib/test_decoding/output_iso/log/postmaster.log contains: TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 561, PID: 456844 With the sleep placed above Assert(entry_ref->shared_entry->dropped) this Assert fails too. Best regards, Alexanderdiff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index c7ce603706..9b3389a90e 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -9,6 +9,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \ ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \ oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \ twophase_snapshot slot_creation_error catalog_change_snapshot +ISOLATION = oldest_xmin oldest_xmin oldest_xmin REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c index 09fffd0e82..5307116670 100644 --- a/src/backend/utils/activity/pgstat_shmem.c +++ b/src/backend/utils/activity/pgstat_shmem.c @@ -551,6 +551,7 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref, /* only dropped entries can reach a 0 refcount */ Assert(entry_ref->shared_entry->dropped); +pg_usleep(10L); shent = dshash_find(pgStatLocal.shared_hash, &entry_ref->shared_entry->key, true);
Re: BF mamba failure
Amit Kapila writes: > Peter Smith has recently reported a BF failure [1]. AFAICS, the call > stack of failure [2] is as follows: Note the assertion report a few lines further up: TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004 regards, tom lane
BF mamba failure
Hi, Peter Smith has recently reported a BF failure [1]. AFAICS, the call stack of failure [2] is as follows: 0x1e66644 at postgres 0x1d0143c at postgres 0x1d02534 at postgres 0x1cfb120 at postgres 0x1cfd590 at postgres 0x1cfbfc0 at postgres 0x1ca7b08 at postgres 0x1ca7c74 at postgres 0x1ca7d2c at postgres 0x1cdf060 at postgres 0x1c203f4 at postgres 0x1c2161c at postgres 0x1edcf94 at postgres I couldn't correlate it to the recent commits. Any thoughts? [1] - https://www.postgresql.org/message-id/CAHut%2BPsHdWFjU43VEX%2BR-8de6dFQ-_JWrsqs%3DvWek1hULexP4Q%40mail.gmail.com [2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-03-17%2005%3A36%3A10 -- With Regards, Amit Kapila.