Hello, I did stress testing on v35 patches, where I did concurrency test using pgbench with 50 concurrent clients, 4 threads with the below pgbench script (dual_chaos.sql) on the following table setup(setup.sql). I ran pgbench with 5M rows for 10 minutes and 50M for ~45 minutes multiple times. REPACK (concurrently) ran successfully except "once"(see below). I created a shadow/clone table to use for checking the correctness after doing the concurrency test.I used 4 checks to verify that data is intact and REPACK (concurrently) ran successfully.
1) table file OID(relfilenode) swapped?
2) bloat gone? victim relation size should be less than
shadow relation size.
3) using FULL JOIN logic (borrowed from repack.spec, with small change)
against the shadow table which goes under the same concurrent ops
done on the victim table , basically doing dual writes (see dual_chaos.sql)
to
verify table data integrity.
4) Physical Index Integrity (amcheck) (borrowed from Mihail's tests)
The concurrency test failed once. I tried to reproduce the below scenario
but no luck,i think the reason the assert failure happened because
after speculative insert there might be no spec CONFIRM or ABORT, thoughts?
TRAP: failed Assert("!specinsert"), File: "reorderbuffer.c", Line: 2610,
PID: 3956168
postgres: REPACK decoding worker for relation "stress_victim"
(ExceptionalCondition+0x98)[0xaaaab1251188]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x67b1cc)[0xaaaab0f4b1cc]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x67b86c)[0xaaaab0f4b86c]
postgres: REPACK decoding worker for relation "stress_victim"
(ReorderBufferCommit+0x74)[0xaaaab0f4b8f0]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x66229c)[0xaaaab0f3229c]
postgres: REPACK decoding worker for relation "stress_victim"
(xact_decode+0x1a0)[0xaaaab0f312bc]
postgres: REPACK decoding worker for relation "stress_victim"
(LogicalDecodingProcessRecord+0xd4)[0xaaaab0f30e60]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x3372e4)[0xaaaab0c072e4]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x339634)[0xaaaab0c09634]
postgres: REPACK decoding worker for relation "stress_victim"
(RepackWorkerMain+0x1ac)[0xaaaab0c094e8]
postgres: REPACK decoding worker for relation "stress_victim"
(BackgroundWorkerMain+0x2b0)[0xaaaab0efc440]
postgres: REPACK decoding worker for relation "stress_victim"
(postmaster_child_launch+0x1f0)[0xaaaab0f00398]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x639ca4)[0xaaaab0f09ca4]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x639f94)[0xaaaab0f09f94]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x638714)[0xaaaab0f08714]
postgres: REPACK decoding worker for relation "stress_victim"
(+0x635978)[0xaaaab0f05978]
postgres: REPACK decoding worker for relation "stress_victim"
(PostmasterMain+0x160c)[0xaaaab0f050c8]
postgres: REPACK decoding worker for relation "stress_victim"
(main+0x3dc)[0xaaaab0d974d4]
/lib/aarch64-linux-gnu/libc.so.6(+0x284c4)[0xffff867584c4]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffff86758598]
postgres: REPACK decoding worker for relation "stress_victim"
(_start+0x30)[0xaaaab09bc1f0]
2026-02-19 18:20:56.088 IST [3905812] LOG: checkpoint starting: wal
2026-02-19 18:21:10.683 IST [3905808] LOG: background worker "REPACK
decoding worker" (PID 3956168) was terminated by signal 6: Aborted
Crash Test:
i did crash test using debugger using a breakpoint
inside apply_concurrent_changes
to simulate a crash while concurrent changes are being done, after few
concurrent changes
are done , i crashed the server using "pg_ctl -m immediate stop", then
restarted the server,
i observed that REPACK (concurrently) didn't completed (expected), files
were not swapped and data
on the victim table is intact checked using FULL JOIN with shadow table,
but there are
some leftovers of the transient table we used for REPACK (concurrently)
such as
1) transient table's relation files - these consume extra space , i think
this was the
case with VACUUM FULL previously, so these has to be removed manually , but
I think this time we have a "leverage" which we can use to remove the extra
space.
2) transient table's WALs - these are generated because of concurrent
changes done while
applying the logical decoded changes on the new transient table, i think
this won't be a problem
until they only will get recycled but if they get archived , they are of no
use instead they
consume more space and time during the archival process.
"Leverage" Idea:
i think we can re-use these transient table's relation files and WALs
during crash recovery,
so that user don't have to re-run the REPACK (concurrently) after server
has recovered,
for this we might need to write a WAL for REPACK (concurrently) to let
startup process
know REPACK (concurrently) occurred which sets a flag, so at the end of
startup process
all the WALs of the transient table are already applied so transient table
perfect now ,
at the end we can do swapping (finish_heap_swap) after checking the flag ,
these are
all my initial thoughts on this idea to reuse the "residue" files of the
transient table.
I could be totally wrong :) Please correct me if I am.
i think we need to update this statement in repack.sgml regarding wal_level
<listitem>
<para>
The <link
linkend="guc-wal-level"><varname>wal_level</varname></link>
configuration parameter is less than <literal>logical</literal>.
</para>
</listitem>
because of this commit POC: enable logical decoding when wal_level =
'replica' without a server restart (67c2097)
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
setup.sql
Description: Binary data
dual_chaos.sql
Description: Binary data
