Hello hackers,

I've investigated a recent buildfarm failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-05-02%2006%3A40%3A36

where the test failed due to a CRC error:
2024-05-02 17:08:18.401 ACST [3406:7] LOG:  incorrect resource manager data 
checksum in record at 0/F14D7A60

(Chipmunk produced similar errors as well:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-08-25%2019%3A40%3A11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2024-03-22%2003%3A20%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2023-08-19%2006%3A43%3A20
)

and discovered that XLogRecordAssemble() calculates CRC over a buffer,
that might be modified by another process.

With the attached patch applied, the following test run:
echo "
autovacuum_naptime = 1
autovacuum_vacuum_threshold = 1

wal_consistency_checking = all
" >/tmp/temp.config

for ((i=1;i<=100;i++)); do echo "iteration $i"; TEMP_CONFIG=/tmp/temp.config TESTS="test_setup hash_index" make check-tests -s || break; done

fails for me on iterations 7, 10, 17:
ok 1         - test_setup                               2557 ms
not ok 2     - hash_index                              24719 ms
# (test process exited with exit code 2)

postmaster.log contains:
2024-05-10 12:46:44.320 UTC checkpointer[1881151] LOG:  checkpoint starting: 
immediate force wait
2024-05-10 12:46:44.365 UTC checkpointer[1881151] LOG:  checkpoint complete: wrote 41 buffers (0.3%); 0 WAL file(s) added, 0 removed, 26 recycled; write=0.001 s, sync=0.001 s, total=0.046 s; sync files=0, longest=0.000 s, average=0.000 s; distance=439134 kB, estimate=527137 kB; lsn=0/3CE131F0, redo lsn=0/3CE13198
TRAP: failed Assert("memcmp(block1_ptr, block1_copy, block1_len) == 0"), File: 
"xloginsert.c", Line: 949, PID: 1881271
ExceptionalCondition at assert.c:52:13
XLogRecordAssemble at xloginsert.c:953:1
XLogInsert at xloginsert.c:520:9
hashbucketcleanup at hash.c:844:14
hashbulkdelete at hash.c:558:3
index_bulk_delete at indexam.c:760:1
vac_bulkdel_one_index at vacuum.c:2498:10
lazy_vacuum_one_index at vacuumlazy.c:2443:10
lazy_vacuum_all_indexes at vacuumlazy.c:2026:26
lazy_vacuum at vacuumlazy.c:1944:10
lazy_scan_heap at vacuumlazy.c:1050:3
heap_vacuum_rel at vacuumlazy.c:503:2
vacuum_rel at vacuum.c:2214:2
vacuum at vacuum.c:622:8
autovacuum_do_vac_analyze at autovacuum.c:3102:2
do_autovacuum at autovacuum.c:2425:23
AutoVacWorkerMain at autovacuum.c:1569:3
pgarch_die at pgarch.c:846:1
StartChildProcess at postmaster.c:3929:5
StartAutovacuumWorker at postmaster.c:3997:12
process_pm_pmsignal at postmaster.c:3809:3
ServerLoop at postmaster.c:1667:5
BackgroundWorkerInitializeConnection at postmaster.c:4156:1
main at main.c:184:3
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc71a8d7e40]
postgres: autovacuum worker regression(_start+0x25)[0x556a8631a5e5]
2024-05-10 12:46:45.038 UTC checkpointer[1881151] LOG:  checkpoint starting: 
immediate force wait
2024-05-10 12:46:45.965 UTC autovacuum worker[1881275] LOG: automatic analyze of table 
"regression.pg_catalog.pg_attribute"
        avg read rate: 0.000 MB/s, avg write rate: 5.409 MB/s
        buffer usage: 1094 hits, 0 misses, 27 dirtied
        system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.03 s
2024-05-10 12:46:46.892 UTC postmaster[1881150] LOG:  server process (PID 
1881271) was terminated by signal 6: Aborted
2024-05-10 12:46:46.892 UTC postmaster[1881150] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE public.hash_cleanup_heap

(This can be reproduced with 027_stream_regress, of course, but it would
take more time.)

Best regards,
Alexander
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 9047601534..aeb462c137 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -902,8 +902,21 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	 */
 	INIT_CRC32C(rdata_crc);
 	COMP_CRC32C(rdata_crc, hdr_scratch + SizeOfXLogRecord, hdr_rdt.len - SizeOfXLogRecord);
+
+char block1_copy[BLCKSZ];
+char *block1_ptr = NULL;
+int block1_len = 0;
+
 	for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
+{
+	if (!block1_ptr)
+	{
+		memcpy(block1_copy, rdt->data, rdt->len);
+		block1_ptr = rdt->data;
+		block1_len = rdt->len;
+	}
 		COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
+}
 
 	/*
 	 * Ensure that the XLogRecord is not too large.
@@ -930,6 +943,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	rechdr->xl_prev = InvalidXLogRecPtr;
 	rechdr->xl_crc = rdata_crc;
 
+if (block1_ptr)
+{
+pg_usleep(100);
+Assert(memcmp(block1_ptr, block1_copy, block1_len) == 0);
+}
+
 	return &hdr_rdt;
 }
 

Reply via email to