On 31/08/2023 07:00, Thomas Munro wrote:
Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that.  This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream".

I love this idea! Makes it a lot easier to perform prefetch, as evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:

 13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised by v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:

 4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped that this would simplify it. I didn't look closely at that patch, so maybe it's simpler even though it's more code.

There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model.

All this makes sense. Some random comments on the patches:

+       /* Avoid a slightly more expensive kernel call if there is no benefit. 
*/
+       if (iovcnt == 1)
+               returnCode = pg_pread(vfdP->fd,
+                                                         iov[0].iov_base,
+                                                         iov[0].iov_len,
+                                                         offset);
+       else
+               returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);

How about pushing down this optimization to pg_preadv() itself? pg_readv() is currently just a macro if the system provides preadv(), but it could be a "static inline" that does the above dance. I think that optimization is platform-dependent anyway, pread() might not be any faster on some OSs. In particular, if the system doesn't provide preadv() and we use the implementation in src/port/preadv.c, it's the same kernel call anyway.

v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch

No smgrextendv()? I guess none of the patches here needed it.

/*
 * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
 * the returned buffer can be used immediately.  Otherwise, a physical read
 * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
 * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
 * caller has the opportunity to coalesce reads of neighboring blocks into one
 * CompleteReadBuffers() call.
 *
 * *found is set to true for a hit, and false for a miss.
 *
 * *allocated is set to true for a miss that allocates a buffer for the first
 * time.  If there are multiple calls to PrepareReadBuffer() for the same
 * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
 * read, then only the first such call will receive *allocated == true, which
 * the caller might use to issue just one prefetch hint.
 */
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
                                  ForkNumber forkNum,
                                  BlockNumber blockNum,
                                  BufferAccessStrategy strategy,
                                  bool *found,
                                  bool *allocated)


If you decide you don't want to perform the read, after all, is there a way to abort it without calling CompleteReadBuffers()? Looking at the later patch that introduces the streaming read API, seems that it finishes all the reads, so I suppose we don't need an abort function. Does it all get cleaned up correctly on error?

/*
 * Convert an array of buffer address into an array of iovec objects, and
 * return the number that were required.  'iov' must have enough space for up
 * to PG_IOV_MAX elements.
 */
static int
buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
The comment is a bit inaccurate. There's an assertion that If nblocks <= PG_IOV_MAX, so while it's true that 'iov' must have enough space for up to PG_IOV_MAX elements, that's only because we also assume that nblocks <= PG_IOV_MAX.

I don't see anything in the callers (mdreadv() and mdwritev()) to prevent them from passing nblocks > PG_IOV_MAX.

in streaming_read.h:

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
        uintptr_t pgsr_private,
        void *per_io_private,
        BufferManagerRelation *bmr,
        ForkNumber *forkNum,
        BlockNumber *blockNum,
        ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on each read. I see that you used that in the WAL replay prefetching, so I guess that makes sense.

extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void 
**per_io_private);
extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL replay prefetching patch, and only after calling pg_streaming_read_reset(). Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly? Is there any need to "reset" the stream, without also starting prefetching?

In v1-0012-WIP-Use-streaming-reads-in-recovery.patch:

@@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 
block_id,
  * If the WAL record contains a block reference with the given ID, *rlocator,
  * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
  * returns true.  Otherwise returns false.
+ *
+ * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
+ * of the pin is transferred to the caller.
  */
 bool
 XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
@@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, 
uint8 block_id,
        if (blknum)
                *blknum = bkpb->blkno;
        if (prefetch_buffer)
+       {
                *prefetch_buffer = bkpb->prefetch_buffer;
+
+               /*
+                * Clear this flag is so that we can assert that redo records 
take
+                * ownership of all buffers pinned by xlogprefetcher.c.
+                */
+               bkpb->prefetch_buffer = InvalidBuffer;
+       }
        return true;
 }

Could these changes be committed independently of all the other changes?

--
Heikki Linnakangas
Neon (https://neon.tech)



Reply via email to