Re: pg_waldump: support decoding of WAL inside tarfile

Robert Haas Fri, 16 Jan 2026 12:38:31 -0800

On Fri, Jan 2, 2026 at 7:30 AM Amul Sul <[email protected]> wrote:
> Attached is the rebased version, includes some code comment improvements.


Regarding 0001, I don't think that passing the WAL segment size around
in a global variable is a good practice. It makes it hard to tell
whether the variable is guaranteed to have been set properly at all
the places where it is used. Of course, this patch set does not
introduce that problem, but I think it could avoid making it worse by
adding WalSegSz to XLogDumpPrivate instead of exporting the global
variable everywhere. That struct seems to be available everyplace
where we need the value. Actually, though, I'm inclined to remove the
file-level global as well. Patch for that attached. I'm not against
global variables in general, but I'm against global variables that are
just there to avoid passing the right stuff around via function
argument lists.

0002 claims to be just a refactoring but if
(unlikely(private->endptr_reached)) return -1 is net new code, so I
object. I realize this contradicts the review comment from Chao Li,
but the principal that refactoring shouldn't change the logic is more
important than whatever value we might get out of adding this sanity
check here -- and I bet if we look into it we'll find that there are
lots of other places in PostgreSQL that would also need to be changed
if we wanted such a sanity check everywhere that we have code like
this. I'd also like to point out that the commit message isn't really
adequate to understand why this helps achieve the overall goal of the
patch set.

0003 looks functionally correct but notice that you've stolen some but
not all of the tests from the "# various ways of specifying WAL range"
section without copying the comment or explaining in the commit
message why you've moved some tests but not others.

Regarding 0004:

Instead of making ArchivedWAL_HTAB a global variable, can we just
store a pointer to it in XLogDumpPrivate? (I would change the name to
something else, something all lower case.) Or maybe the pointer should
go someplace else, but I'm skeptical of a global variable here for the
same reasons as 0001.

I think ArchivedWALFile would be a better name than ArchivedWALEntry,
and cur_wal_file or cur_file would be a better name than cur_wal.

I'm extremely happy with the way that the hash table looks and the new
logic in init_archive_reader() to make sure that we don't need to
reread the beginning of the WAL but can reuse the data already read.
That seems like a huge improvement to me.

+               if (recptr >= endPtr)
+               {
+                       /* Discard the buffered data */
+                       resetStringInfo(&entry->buf);
+                       len = 0;
+
+                       /*
+                        * Push back the partial page data for the
current page to the
+                        * buffer, ensuring it remains full page
available for re-reading
+                        * if requested.
+                        */
+                       if (p > readBuff)
+                       {
+                               Assert((count - nbytes) > 0);
+                               appendBinaryStringInfo(&entry->buf,
readBuff, count - nbytes);
+                       }
+               }

It feels pretty inelegant to me to pull data back into &entry->buf
from readBuff here. I think what you should do is just do this test
once, before entering the while (nbytes > 0) loop. Then you'll always
have p == readBuff, so the nested if statement is no longer required.
Also, the way you have it, if this triggers for the first loop
iteration, it will trigger for every subsquent loop iteration, which
seems wasteful.

+               if (len > 0 && recptr > startPtr)
+               {
+                       int                     skipBytes = 0;
+
+                       /*
+                        * The required offset is not at the start of
the buffer, so skip
+                        * bytes until reaching the desired offset of
the target page.
+                        */
+                       skipBytes = recptr - startPtr;
+
+                       buf += skipBytes;
+                       len -= skipBytes;
+               }

This logic seems pretty confusing. It looks as though len can go
negative, because what if skipBytes > len? I'm not sure that's really
possible here, but there's no comments explaining why it can't and no
assertion verifying that it doesn't. I think you should probably try
to unify this with the following if statement that is gated by (len >
0). In other places where we have code like this, what I've usually
found useful is to compute the number of bytes to copy as Min(bytes
available from source, bytes that will fit into the target), and I
think that technique might be useful here. For example, in
BlockRefTableRead, length is always the number of bytes that the
caller wants which we haven't yet read, and we compute bytes_to_copy =
Min(length, buffer->used - buffer->cursor). We then memcpy(data,
&buffer->data[buffer->cursor], bytes_to_copy) and then adjust data and
length. A similar bit of code not written by me can be found in
WALReadFromBuffers, which maintains nbytes as the remaining number of
bytes to be copied and then computes the size of the next copy as
npagebytes = Min(nbytes, XLOG_BLCKSZ - offset). I think your if (len >
0 && recptr > startPtr) block is trying to do something similar to
what the Min() is doing in those examples, but I find it hard to
understand whether it is doing it correctly. I think you need some
combination of clearer variable names, more comments, and/or rewriting
this to be easier to read and understand without help.

+               entry = privateInfo->cur_wal;
+
+               /* Fetch more data */
+               if (read_archive_file(privateInfo, READ_CHUNK_SIZE) == 0)
+                       break;                  /* archive file ended */

Why do we set entry to privateInfo->cur_wal and then call
read_archive_file() afterwards? Doesn't read_archive_file() have the
potential to change cur_wal? If so, don't we want to look at the entry
that is current after we've called read_archive_file(), rather than
before? Also, to harp on the naming a bit more, stuff like entry =
privateInfo->cur_wal kind of shows that you haven't made the naming
real consistent. This could be more clear if it said something like
wal_file = privateInfo->cur_wal_file.

+               /* Found the required entry */
+               if (entry->segno == segno)
+                       return entry;
+
+               /*
+                * Ignore if the timeline is different or the current
segment is not
+                * the desired one.
+                */
+               if (privateInfo->timeline != entry->timeline ||
+                       privateInfo->startSegNo > entry->segno ||
+                       privateInfo->endSegNo < entry->segno)
+               {
+                       privateInfo->cur_wal = NULL;
+                       continue;
+               }
+

I don't really understand what's happening here. One thing that's odd
is that the first of these two if statements considers the timeline
and the second does not. It seems unlikely that this is correct. I
suspect that the hash table needs to be keyed by (TLI, segno) rather
than just by TLI, and that get_archive_wal_entry needs to take both a
TLI and a segment number as an argument.  Otherwise, what could
possibly prevent the first if statement from returning the correct
segment from the wrong timeline?

(Likewise, I think stuff in 0005 like prepare_tmp_write() should work
on a TLI and a segno, not just a segno. There probably should be very
few places in the patch where you pass only a segno and no TLI.
get_tmp_walseg_path() is a pretty clear sign of the problem: If you
were including the TLI in the file name, you would just use
XLogFileName() here, but because you left out the TLI, you end up
having to invent your own file naming convention. That seems pretty
bad: wouldn't we like the temporary file names to have the same names
as the files inside the archive? What happens if the same WAL file
exists in the archive with two different TLIs?)

Another thing that's odd is: why is this function resetting
privateInfo->cur_wal to NULL? I feel like it should be the job of
astreamer_waldump_content() to manage the value of
privateInfo->cur_wal, and this function doesn't seem to have any
business touching it. Even if it thinks that we want to ignore that
WAL file for purposes of *this* call to get_archive_wal_entry(),
surely it has no right to decide that the *next* call to this function
should ignore that entry as well.

Honestly, I don't really understand why we need to poke into
privateInfo at all here. I feel like the shape of this loop should be:
while (1) { if (read_archive_file(...) == 0) break; if (we have the
correct entry) return entry; } }. It seems to me that it ought to be
the job of the archive streamer to handle everything else, including
buffering more data and spilling to files. This should just iterate,
repeatedly calling read_archive_file(), until we either get to the
file we want or run out of archive.

+               /*
+                * XXX: If the segment being read not the requested
one, the data must
+                * be buffered, as we currently lack the mechanism to
write it to a
+                * temporary file. This is a known limitation that
will be fixed in the
+                * next patch, as the buffer could grow up to the full
WAL segment
+                * size.
+                */
+               if (segno > entry->segno)
+                       continue;

I couldn't figure out what was going on here for a while. Then I
realized that this makes some sense: if (segno > entry->segno), that
means that the segment number that we've found in the archive is older
than the requested segment number. Because we don't back up, that
means we don't need the data. I think the idea is that we would fall
through here and buffer the data if this if-test fails, but patch 0005
actually removes these two lines of code, so I think something here is
kind of messed up. I think the result will be that with both patches
applied, we'll buffer data we don't actually need for anything.

Generally, I think that some details here should be revisited:

+typedef struct ArchivedWALEntry
+{
+       uint32          status;                 /* hash status */
+       XLogSegNo       segno;                  /* hash key: WAL
segment number */
+       TimeLineID      timeline;               /* timeline of this wal file */
+
+       StringInfoData buf;
+       bool            tmpseg_exists;  /* spill file exists? */
+
+       int                     total_read;             /* total read
of archived WAL segment */
+} ArchivedWALEntry;

The first few members are great, except that timeline needs to be part
of the hash key as well, so you probably need to make a two-element
struct with a segno and a TLI and make that struct be the thing that
comes right after status. After that, I'm not so sure. Essentially, I
think what's happening here is that segno*file_size+total_read is the
end LSN for buf, and the start LSN value is that value minus buf.len.
And if tmpseg_exists is true, then the buffer might be empty, and you
can read however much of the segment we found from the temp file. I
find that confusing. If we're going to do something like this, I feel
like it would make more sense to store XLogRecPtr starttli instead of
int total_read, so that we explicitly mention the first LSN stored in
the buffer, and you infer the ending LSN from the length of the
buffer.

But, stepping back a bit, why do we have a StringInfoData in every
ArchivedWALEntry? It makes sense to do it that way if we're going to
buffer all of the WAL data in memory. In that case, you need a
separate buffer for every ArchivedWALEntry, and so this is logical.
But if we're going to spill to disk, then there should only ever be
one buffer in use at any given time, so why have a buffer in every
ArchivedWALEntry instead of a single buffer inside of XLogDumpPrivate?
We could choose to treat the hash table as the set of previous
segments for which we have written data to disk, and the current
segment and any associated buffer would be tracked inside the
XLogDumpPrivate or astreamer_waldump, which would mean that only one
copy of them would exist. I'm tempted to say that the direction I'm
sketching here is the right way to go, but I'm not 100% certain that's
true. It could be write to have a buffer per WAL file, as you have
here, but if so, we should know why we're doing that and how it fits
into the design, and I'm not currently seeing a real reason for it.

+       if (strstr(member->pathname, "PaxHeaders."))
+               return false;

There is no way that a filename containing the string "PaxHeaders."
could ever pass the IsXLogFileName test just above. We shouldn't need
this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

0001-Remove-file-level-global-WalSegSz.patch.nocfbot
Description: Binary data

Re: pg_waldump: support decoding of WAL inside tarfile

Reply via email to