On Saturday 2025-06-14 18:17, Dimitrios Apostolou wrote:
Out of curiosity I've tried the same with an uncompressed dump
(--compress=none). Surprisingly it seems the blocksize is even smaller.
With my patched pg_restore I only get 4K reads and nothing else on the strace
output.
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
To clarify this output again, I have a huge uncompressed custom format
dump without TOC (because pg_dump was writing to stdout), and at this
point pg_restore is going through the whole archive to find the items it
needs. Allow me to explain what goes on at this point since I have
better insight now.
The code in question, in pg_backup_custom.c:
/*
* Skip data from current file position.
* Data blocks are formatted as an integer length, followed by data.
* A zero length indicates the end of the block.
*/
static void
_skipData(ArchiveHandle *AH)
{
lclContext *ctx = (lclContext *) AH->formatData;
size_t blkLen;
char *buf = NULL;
int buflen = 0;
blkLen = ReadInt(AH);
while (blkLen != 0)
{
/* Sequential access is usually faster, so avoid seeking if the
* jump forward is shorter than 1MB. */
if (ctx->hasSeek && blkLen > 1024 * 1024)
{
if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
pg_fatal("error during file seek: %m");
}
else
{
if (blkLen > buflen)
{
free(buf);
buf = (char *) pg_malloc(blkLen);
buflen = blkLen;
}
if (fread(buf, 1, blkLen, AH->FH) != blkLen)
{
if (feof(AH->FH))
pg_fatal("could not read from input file:
end of file");
else
pg_fatal("could not read from input file:
%m");
}
}
blkLen = ReadInt(AH);
}
free(buf);
}
blkLen is almost always a number around 35 to 38.
So fread() is called all the time doing reads of about ~35 bytes.
Then ReadInt() is actually doing getc() a few times.
And it loops over.
Libc is doing buffering of 4k, and that's how we end up seeing so many
4k reads. This also explains the ~80 lseek() between each 4k read() on
the unpatched version, mentioned in previous email.
I've tried setvbuf() like Thomas Munro suggested and I saw a significant
improvement by allocating and using a 1MB buffer for libc stream
buffering.
Question that remains: where is pg_dump setting this ~35B length block?
Is that easy to change without breaking old versions?
Thanks in advance,
Dimitris