On Sun, Mar 19, 2023 at 2:00 AM Alexander Lakhin <[email protected]> wrote:
>
> Hi,
>
> 18.03.2023 07:26, Tom Lane wrote:
>
> Amit Kapila <[email protected]> writes:
>
> Peter Smith has recently reported a BF failure [1]. AFAICS, the call
> stack of failure [2] is as follows:
>
> Note the assertion report a few lines further up:
>
> TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount)
> == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004
>
>
> This assertion failure can be reproduced easily with the attached patch:
> ============== running regression test queries ==============
> test oldest_xmin ... ok 55 ms
> test oldest_xmin ... FAILED (test process exited with exit
> code 1) 107 ms
> test oldest_xmin ... FAILED (test process exited with exit
> code 1) 8 ms
> ============== shutting down postmaster ==============
>
> contrib/test_decoding/output_iso/log/postmaster.log contains:
> TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount)
> == 0"), File: "pgstat_shmem.c", Line: 561, PID: 456844
>
> With the sleep placed above Assert(entry_ref->shared_entry->dropped) this
> Assert fails too.
>
> Best regards,
> Alexander
I used a slightly modified* patch of Alexander's [1] applied to the
latest HEAD code (but with my "toptxn" patch reverted).
--- the patch was modified in that I injected 'sleep' both above and
below the Assert(entry_ref->shared_entry->dropped).
Using this I was also able to reproduce the problem. But test failures
were rare. The make check-world seemed OK, and indeed the
test_decoding tests would also appear to PASS around 14 out of 15
times.
============== running regression test queries ==============
test oldest_xmin ... ok 342 ms
test oldest_xmin ... ok 121 ms
test oldest_xmin ... ok 283 ms
============== shutting down postmaster ==============
============== removing temporary instance ==============
=====================
All 3 tests passed.
=====================
~~
Often (but not always) depite the test_decoding reported PASS all 3
tests as "ok", I still observed there was a TRAP in the logfile
(contrib/test_decoding/output_iso/log/postmaster.log).
TRAP: failed Assert("entry_ref->shared_entry->dropped")
~~
Occasionally (about 1 in 15 test runs) the test would fail the same
way as described by Alexander [1], with the accompanying TRAP.
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount)
== 0"), File: "pgstat_shmem.c", Line: 562, PID: 32013
============== running regression test queries ==============
test oldest_xmin ... ok 331 ms
test oldest_xmin ... ok 91 ms
test oldest_xmin ... FAILED 702 ms
============== shutting down postmaster ==============
======================
1 of 3 tests failed.
======================
~~
FWIW, the "toptxn" patch. whose push coincided with the build-farm
error I first reported [2], turns out to be an innocent party in this
TRAP. We know this because all of the above results were running using
HEAD code but with that "toptxn" patch reverted.
------
[1]
https://www.postgresql.org/message-id/1941b7e2-be7c-9c4c-8505-c0fd05910e9a%40gmail.com
[2]
https://www.postgresql.org/message-id/CAHut%2BPsHdWFjU43VEX%2BR-8de6dFQ-_JWrsqs%3DvWek1hULexP4Q%40mail.gmail.com
Kind Regards,
Peter Smith.
Fujitsu Australia