O (WIP)

Heikki Linnakangas Wed, 27 Sep 2023 20:12:52 -0700

On 31/08/2023 07:00, Thomas Munro wrote:

Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that.  This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream".

I love this idea! Makes it a lot easier to perform prefetch, asevidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:


 13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised byv1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:


 4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so Ihoped that this would simplify it. I didn't look closely at that patch,so maybe it's simpler even though it's more code.

There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model.


All this makes sense. Some random comments on the patches:

+       /* Avoid a slightly more expensive kernel call if there is no benefit. 
*/
+       if (iovcnt == 1)
+               returnCode = pg_pread(vfdP->fd,
+                                                         iov[0].iov_base,
+                                                         iov[0].iov_len,
+                                                         offset);
+       else
+               returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);

How about pushing down this optimization to pg_preadv() itself?pg_readv() is currently just a macro if the system provides preadv(),but it could be a "static inline" that does the above dance. I thinkthat optimization is platform-dependent anyway, pread() might not be anyfaster on some OSs. In particular, if the system doesn't providepreadv() and we use the implementation in src/port/preadv.c, it's thesame kernel call anyway.

v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch


No smgrextendv()? I guess none of the patches here needed it.

/*
 * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
 * the returned buffer can be used immediately.  Otherwise, a physical read
 * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
 * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
 * caller has the opportunity to coalesce reads of neighboring blocks into one
 * CompleteReadBuffers() call.
 *
 * *found is set to true for a hit, and false for a miss.
 *
 * *allocated is set to true for a miss that allocates a buffer for the first
 * time.  If there are multiple calls to PrepareReadBuffer() for the same
 * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
 * read, then only the first such call will receive *allocated == true, which
 * the caller might use to issue just one prefetch hint.
 */
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
                                  ForkNumber forkNum,
                                  BlockNumber blockNum,
                                  BufferAccessStrategy strategy,
                                  bool *found,
                                  bool *allocated)

If you decide you don't want to perform the read, after all, is there away to abort it without calling CompleteReadBuffers()? Looking at thelater patch that introduces the streaming read API, seems that itfinishes all the reads, so I suppose we don't need an abort function.Does it all get cleaned up correctly on error?

/*
 * Convert an array of buffer address into an array of iovec objects, and
 * return the number that were required.  'iov' must have enough space for up
 * to PG_IOV_MAX elements.
 */
static int
buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)

The comment is a bit inaccurate. There's an assertion that If nblocks<= PG_IOV_MAX, so while it's true that 'iov' must have enough space forup to PG_IOV_MAX elements, that's only because we also assume thatnblocks <= PG_IOV_MAX.

I don't see anything in the callers (mdreadv() and mdwritev()) toprevent them from passing nblocks > PG_IOV_MAX.


in streaming_read.h:

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
        uintptr_t pgsr_private,
        void *per_io_private,
        BufferManagerRelation *bmr,
        ForkNumber *forkNum,
        BlockNumber *blockNum,
        ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately oneach read. I see that you used that in the WAL replay prefetching, so Iguess that makes sense.

extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void 
**per_io_private);
extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in theWAL replay prefetching patch, and only after callingpg_streaming_read_reset(). Could pg_streaming_read_reset() callpg_streaming_read_prefetch() directly? Is there any need to "reset" thestream, without also starting prefetching?


In v1-0012-WIP-Use-streaming-reads-in-recovery.patch:

@@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 
block_id,
  * If the WAL record contains a block reference with the given ID, *rlocator,
  * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
  * returns true.  Otherwise returns false.
+ *
+ * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
+ * of the pin is transferred to the caller.
  */
 bool
 XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
@@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, 
uint8 block_id,
        if (blknum)
                *blknum = bkpb->blkno;
        if (prefetch_buffer)
+       {
                *prefetch_buffer = bkpb->prefetch_buffer;
+
+               /*
+                * Clear this flag is so that we can assert that redo records 
take
+                * ownership of all buffers pinned by xlogprefetcher.c.
+                */
+               bkpb->prefetch_buffer = InvalidBuffer;
+       }
        return true;
 }


Could these changes be committed independently of all the other changes?

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Streaming I/O, vectored I/O (WIP)

Reply via email to