On 31/08/2023 07:00, Thomas Munro wrote:
Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that. This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream".
I love this idea! Makes it a lot easier to perform prefetch, as
evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:
13 files changed, 289 insertions(+), 637 deletions(-)
I'm a bit disappointed and surprised by
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:
4 files changed, 244 insertions(+), 78 deletions(-)
The current prefetching logic in vacuumlazy.c is pretty hairy, so I
hoped that this would simplify it. I didn't look closely at that patch,
so maybe it's simpler even though it's more code.
There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model.
All this makes sense. Some random comments on the patches:
+ /* Avoid a slightly more expensive kernel call if there is no benefit.
*/
+ if (iovcnt == 1)
+ returnCode = pg_pread(vfdP->fd,
+ iov[0].iov_base,
+ iov[0].iov_len,
+ offset);
+ else
+ returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.
v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch
No smgrextendv()? I guess none of the patches here needed it.
/*
* Prepare to read a block. The buffer is pinned. If this is a 'hit', then
* the returned buffer can be used immediately. Otherwise, a physical read
* should be completed with CompleteReadBuffers(). PrepareReadBuffer()
* followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
* caller has the opportunity to coalesce reads of neighboring blocks into one
* CompleteReadBuffers() call.
*
* *found is set to true for a hit, and false for a miss.
*
* *allocated is set to true for a miss that allocates a buffer for the first
* time. If there are multiple calls to PrepareReadBuffer() for the same
* block before CompleteReadBuffers() or ReadBuffer_common() finishes the
* read, then only the first such call will receive *allocated == true, which
* the caller might use to issue just one prefetch hint.
*/
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
bool *found,
bool *allocated)
If you decide you don't want to perform the read, after all, is there a
way to abort it without calling CompleteReadBuffers()? Looking at the
later patch that introduces the streaming read API, seems that it
finishes all the reads, so I suppose we don't need an abort function.
Does it all get cleaned up correctly on error?
/*
* Convert an array of buffer address into an array of iovec objects, and
* return the number that were required. 'iov' must have enough space for up
* to PG_IOV_MAX elements.
*/
static int
buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
The comment is a bit inaccurate. There's an assertion that If nblocks
<= PG_IOV_MAX, so while it's true that 'iov' must have enough space for
up to PG_IOV_MAX elements, that's only because we also assume that
nblocks <= PG_IOV_MAX.
I don't see anything in the callers (mdreadv() and mdwritev()) to
prevent them from passing nblocks > PG_IOV_MAX.
in streaming_read.h:
typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);
I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.
extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void
**per_io_private);
extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
extern void pg_streaming_read_free(PgStreamingRead *pgsr);
Do we need to expose pg_streaming_read_prefetch()? It's only used in the
WAL replay prefetching patch, and only after calling
pg_streaming_read_reset(). Could pg_streaming_read_reset() call
pg_streaming_read_prefetch() directly? Is there any need to "reset" the
stream, without also starting prefetching?
In v1-0012-WIP-Use-streaming-reads-in-recovery.patch:
@@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8
block_id,
* If the WAL record contains a block reference with the given ID, *rlocator,
* *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
* returns true. Otherwise returns false.
+ *
+ * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
+ * of the pin is transferred to the caller.
*/
bool
XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
@@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record,
uint8 block_id,
if (blknum)
*blknum = bkpb->blkno;
if (prefetch_buffer)
+ {
*prefetch_buffer = bkpb->prefetch_buffer;
+
+ /*
+ * Clear this flag is so that we can assert that redo records
take
+ * ownership of all buffers pinned by xlogprefetcher.c.
+ */
+ bkpb->prefetch_buffer = InvalidBuffer;
+ }
return true;
}
Could these changes be committed independently of all the other changes?
--
Heikki Linnakangas
Neon (https://neon.tech)