On Saturday 2025-06-14 18:17, Dimitrios Apostolou wrote:

Out of curiosity I've tried the same with an uncompressed dump (--compress=none). Surprisingly it seems the blocksize is even smaller.

With my patched pg_restore I only get 4K reads and nothing else on the strace output.

read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096


To clarify this output again, I have a huge uncompressed custom format dump without TOC (because pg_dump was writing to stdout), and at this point pg_restore is going through the whole archive to find the items it needs. Allow me to explain what goes on at this point since I have better insight now.

The code in question, in pg_backup_custom.c:


/*
 * Skip data from current file position.
 * Data blocks are formatted as an integer length, followed by data.
 * A zero length indicates the end of the block.
*/
static void
_skipData(ArchiveHandle *AH)
{
        lclContext *ctx = (lclContext *) AH->formatData;
        size_t          blkLen;
        char       *buf = NULL;
        int                     buflen = 0;

        blkLen = ReadInt(AH);
        while (blkLen != 0)
        {
                /* Sequential access is usually faster, so avoid seeking if the
                 * jump forward is shorter than 1MB. */
                if (ctx->hasSeek && blkLen > 1024 * 1024)
                {
                        if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
                                pg_fatal("error during file seek: %m");
                }
                else
                {
                        if (blkLen > buflen)
                        {
                                free(buf);
                                buf = (char *) pg_malloc(blkLen);
                                buflen = blkLen;
                        }
                        if (fread(buf, 1, blkLen, AH->FH) != blkLen)
                        {
                                if (feof(AH->FH))
                                        pg_fatal("could not read from input file: 
end of file");
                                else
                                        pg_fatal("could not read from input file: 
%m");
                        }
                }

                blkLen = ReadInt(AH);
        }

        free(buf);
}


blkLen is almost always a number around 35 to 38.
So fread() is called all the time doing reads of about ~35 bytes.
Then ReadInt() is actually doing getc() a few times.
And it loops over.

Libc is doing buffering of 4k, and that's how we end up seeing so many 4k reads. This also explains the ~80 lseek() between each 4k read() on the unpatched version, mentioned in previous email.

I've tried setvbuf() like Thomas Munro suggested and I saw a significant improvement by allocating and using a 1MB buffer for libc stream buffering.

Question that remains: where is pg_dump setting this ~35B length block? Is that easy to change without breaking old versions?


Thanks in advance,
Dimitris



Reply via email to