Hi all, (Adding in CC relevant committer and author, Teodor and Alexander) I have been looking today at a crash of pg_waldump on one of the test builds keeping running in some internal environment. Luckily, I have been able to put my hands on a core file with 11.2 running. The backtrace is not that interesting: (gdb) bt #0 btree_desc (buf=0x0, record=0x22ce590) at nbtdesc.c:103 #1 0x000000000040419f in XLogDumpDisplayRecord (config=0x7fff1ccfd360, record=0x22ce590) at /build/mts/release/bora-13598999/vpostgres/src/postgres/src/bin/pg_waldump/pg_waldump.c:558 #2 main (argc=<optimized out>, argv=<optimized out>) at /build/mts/release/bora-13598999/vpostgres/src/postgres/src/bin/pg_waldump/pg_waldump.c:1170 (gdb) down 1 #0 btree_desc (buf=0x0, record=0x22ce590) at nbtdesc.c:103 103 nbtdesc.c: No such file or directory. (gdb) p record $1 = (XLogReaderState *) 0x22ce590 (gdb) p *record $2 = {wal_segment_size = 16777216, read_page = 0x405220 <XLogDumpReadPage>, system_identifier = 0, private_data = 0x7fff1ccfd380, ReadRecPtr = 67109592, EndRecPtr = 67109672, decoded_record = 0x22cf178, main_data = 0x0, main_data_len = 0, main_data_bufsz = 0, record_origin = 0, blocks = {{in_use = true, rnode = {spcNode = 16399, dbNode = 16386, relNode = 19907}, forknum = MAIN_FORKNUM, blkno = 0, flags = 96 '`', has_image = false, apply_image = false, bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, bimg_info = 0 '\000', has_data = true, data = 0x22db2c0 "\003", data_len = 32, data_bufsz = 8192}, {in_use = false, rnode = {spcNode = 0, dbNode = 0, relNode = 0}, forknum = MAIN_FORKNUM, blkno = 0, flags = 0 '\000', has_image = false, apply_image = false, bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, bimg_info = 0 '\000', has_data = false, data = 0x0, data_len = 0, data_bufsz = 0} <repeats 32 times>}, max_block_id = 0, readBuf = 0x22ceea0 "\230\320\a", readLen = 8192, readSegNo = 4, readOff = 0, readPageTLI = 0, latestPagePtr = 67108864, latestPageTLI = 1, currRecPtr = 67109592, currTLI = 0, currTLIValidUntil = 0, nextTLI = 0, readRecordBuf = 0x22d12b0 "L", readRecordBufSize = 40960, errormsg_buf = 0x22d0eb0 ""} (gdb) p xlrec $5 = (xl_btree_metadata *) 0x0
Anyway, after looking at the code relevant to XLOG_BTREE_META_CLEANUP, I have noticed that the meta-data associated to the first buffer is registered via XLogRegisterBufData() (this is correct because we want to associate this data to the metadata buffer). However, nbtdesc.c assumes that xl_btree_metadata is from the main record data, causing a crash because we have nothing in this case. I think that we could just patch nbtpage.c so as we fetch the metadata using XLogRecGetBlockData(), and log its data. The error comes from 3d92796, which is one year-old and has been visibly committed untested. I am surprised that we have not seen that complain yet. Attached is a patch, which looks right to me and should be back-patched down to v11. I have not taken the time to actually test it though. Thoughts? -- Michael
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c index 989c85ac08..a14eb792ec 100644 --- a/src/backend/access/rmgrdesc/nbtdesc.c +++ b/src/backend/access/rmgrdesc/nbtdesc.c @@ -97,8 +97,10 @@ btree_desc(StringInfo buf, XLogReaderState *record) } case XLOG_BTREE_META_CLEANUP: { - xl_btree_metadata *xlrec = (xl_btree_metadata *) rec; + xl_btree_metadata *xlrec; + xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0, + NULL); appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f", xlrec->oldest_btpo_xact, xlrec->last_cleanup_num_heap_tuples);
signature.asc
Description: PGP signature