Re: Vectored I/O in bulk_write.c
On Tue, Apr 09, 2024 at 04:51:52PM +1200, Thomas Munro wrote: > Here's a rebase. I decided against committing this for v17 in the > end. There's not much wrong with it AFAIK, except perhaps an > unprincipled chopping up of writes with large io_combine_limit due to > simplistic flow control, and I liked the idea of having a decent user > of smgrwritev() in the tree, and it probably makes CREATE INDEX a bit > faster, but... I'd like to try something more ambitious that > streamifies this and also the "main" writeback paths. I see this in the CF as Needs Review since 2024-03-10, but this 2024-04-09 message sounds like you were abandoning it. Are you still commissioning a review of this thread's patches, or is the CF entry obsolete?
Re: Vectored I/O in bulk_write.c
Here's a rebase. I decided against committing this for v17 in the end. There's not much wrong with it AFAIK, except perhaps an unprincipled chopping up of writes with large io_combine_limit due to simplistic flow control, and I liked the idea of having a decent user of smgrwritev() in the tree, and it probably makes CREATE INDEX a bit faster, but... I'd like to try something more ambitious that streamifies this and also the "main" writeback paths. I shared some patches for that that are counterparts to this, over at[1]. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGK1in4FiWtisXZ%2BJo-cNSbWjmBcPww3w3DBM%2BwhJTABXA%40mail.gmail.com From 7ee50aae3d4eba0df5bce05c196f411abb0bd9ab Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 8 Apr 2024 18:19:41 +1200 Subject: [PATCH v7 1/3] Use smgrwritev() for both overwriting and extending. Since mdwrite() and mdextend() were very similar and both needed vectored variants, merge them into a single interface. This reduces duplication and fills out the set of operations. We still want to be able to assert that callers know the difference between overwriting and extending, and to activate slightly different behavior during recovery, so add a new "flags" argument to control that. The goal here is to provide the missing vectored interface without which the new bulk write facility from commit 8af25652 can't write to the file system in bulk. A following commit will connect them together. Like smgrwrite(), the traditional single-block smgrextend() function with skipFsync boolean argument is now translated to smgrwritev() by an inlinable wrapper function. The smgrzeroextend() interface remains distinct; the idea was floated of merging that too, but so far without consensus. Reviewed-by: Heikki Linnakangas Reviewed-by: Andres Freund Discussion: https://postgr.es/m/CA%2BhUKGLx5bLwezZKAYB2O_qHj%3Dov10RpgRVY7e8TSJVE74oVjg%40mail.gmail.com --- src/backend/storage/buffer/bufmgr.c | 10 +-- src/backend/storage/smgr/md.c | 120 src/backend/storage/smgr/smgr.c | 100 ++- src/include/storage/md.h| 4 +- src/include/storage/smgr.h | 19 - 5 files changed, 94 insertions(+), 159 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 44836751b71..5166a839da8 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4011,11 +4011,11 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, * * In recovery, we cache the value returned by the first lseek(SEEK_END) * and the future writes keeps the cached value up-to-date. See - * smgrextend. It is possible that the value of the first lseek is smaller - * than the actual number of existing blocks in the file due to buggy - * Linux kernels that might not have accounted for the recent write. But - * that should be fine because there must not be any buffers after that - * file size. + * smgrzeroextend and smgrwritev. It is possible that the value of the + * first lseek is smaller than the actual number of existing blocks in the + * file due to buggy Linux kernels that might not have accounted for the + * recent write. But that should be fine because there must not be any + * buffers after that file size. */ for (i = 0; i < nforks; i++) { diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index bf0f3ca76d1..73d077ca3ea 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -447,79 +447,10 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo) pfree(path); } -/* - * mdextend() -- Add a block to the specified relation. - * - * The semantics are nearly the same as mdwrite(): write at the - * specified position. However, this is to be used for the case of - * extending a relation (i.e., blocknum is at or beyond the current - * EOF). Note that we assume writing a block beyond current EOF - * causes intervening file space to become filled with zeroes. - */ -void -mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - const void *buffer, bool skipFsync) -{ - off_t seekpos; - int nbytes; - MdfdVec*v; - - /* If this build supports direct I/O, the buffer must be I/O aligned. */ - if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ) - Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); - - /* This assert is too expensive to have on normally ... */ -#ifdef CHECK_WRITE_VS_EXTEND - Assert(blocknum >= mdnblocks(reln, forknum)); -#endif - - /* - * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any - * more --- we mustn't create a block whose number actually is - * InvalidBlockNumber. (Note that this failure should be unreachable - * because of upstream checks in bufmgr.c.) - */ - if (blocknum == InvalidBlockNumber) - ereport(ERROR, -(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
Re: Vectored I/O in bulk_write.c
Then I would make the trivial change to respect the new io_combine_limit GUC that I'm gearing up to commit in another thread. As attached. From 7993cede8939cad9172867ccc690a44ea25d1ad6 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 29 Mar 2024 00:22:53 +1300 Subject: [PATCH] fixup: respect io_combine_limit in bulk_write.c --- src/backend/storage/smgr/bulk_write.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c index 848c3054f5..612a9a23b3 100644 --- a/src/backend/storage/smgr/bulk_write.c +++ b/src/backend/storage/smgr/bulk_write.c @@ -45,12 +45,6 @@ #define MAX_PENDING_WRITES XLR_MAX_BLOCK_ID -/* - * How many blocks to send to smgrwritev() at a time. Arbitrary value for - * now. - */ -#define MAX_BLOCKS_PER_WRITE ((128 * 1024) / BLCKSZ) - static const PGIOAlignedBlock zero_buffer = {{0}}; /* worth BLCKSZ */ typedef struct PendingWrite @@ -232,7 +226,7 @@ smgr_bulk_flush(BulkWriteState *bulkstate) for (int i = 0; i < npending; i++) { Pagepage; - const void *pages[MAX_BLOCKS_PER_WRITE]; + const void *pages[MAX_IO_COMBINE_LIMIT]; BlockNumber blkno; int nblocks; int max_nblocks; @@ -266,14 +260,14 @@ smgr_bulk_flush(BulkWriteState *bulkstate) * We're overwriting. Clamp at the existing size, because we * can't mix writing and extending in a single operation. */ - max_nblocks = Min(lengthof(pages), + max_nblocks = Min(io_combine_limit, bulkstate->pages_written - blkno); } else { /* We're extending. */ Assert(blkno == bulkstate->pages_written); - max_nblocks = lengthof(pages); + max_nblocks = io_combine_limit; } /* Find as many consecutive blocks as we can. */ -- 2.39.3 (Apple Git-146)
Re: Vectored I/O in bulk_write.c
On Sun, Mar 17, 2024 at 8:10 AM Andres Freund wrote: > I don't think zeroextend on the one hand and and on the other hand a normal > write or extend are really the same operation. In the former case the content > is hard-coded in the latter it's caller provided. Sure, we can deal with that > by special-casing NULL content - but at that point, what's the benefit of > combinding the two operations? I'm not dead-set against this, just not really > convinced that it's a good idea to combine the operations. I liked some things about that, but I'm happy to drop that part. Here's a version that leaves smgrzeroextend() alone, and I also gave up on moving errors and assertions up into the smgr layer for now to minimise the change. So to summarise, this gives us smgrwritev(..., flags), where flags can include _SKIP_FSYNC and _EXTEND. This way we don't have to invent smgrextendv(). The old single block smgrextend() still exists as a static inline wrapper. v6-0001-Use-smgrwritev-for-both-overwriting-and-extending.patch Description: Binary data v6-0002-Use-vectored-I-O-for-bulk-writes.patch Description: Binary data v6-0003-Improve-bulk_write.c-memory-management.patch Description: Binary data
Re: Vectored I/O in bulk_write.c
Hi, On 2024-03-16 12:27:15 +1300, Thomas Munro wrote: > I canvassed Andres off-list since smgrzeroextend() is his invention, > and he wondered if it was a good idea to blur the distinction between > the different zero-extension strategies like that. Good question. My > take is that it's fine: > > mdzeroextend() already uses fallocate() only for nblocks > 8, but > otherwise writes out zeroes, because that was seen to interact better > with file system heuristics on common systems. That preserves current > behaviour for callers of plain-old sgmrextend(), which becomes a > wrapper for smgrwrite(..., 1, _ZERO | _EXTEND). If some hypothetical > future caller wants to be able to call smgrwritev(..., NULL, 9 blocks, > _ZERO | _EXTEND) directly without triggering the fallocate() strategy > for some well researched reason, we could add a new flag to say so, ie > adjust that gating. > > In other words, we have already blurred the semantics. To me, it > seems nicer to have a single high level interface for the same logical > operation, and flags to select strategies more explicitly if that is > eventually necessary. I don't think zeroextend on the one hand and and on the other hand a normal write or extend are really the same operation. In the former case the content is hard-coded in the latter it's caller provided. Sure, we can deal with that by special-casing NULL content - but at that point, what's the benefit of combinding the two operations? I'm not dead-set against this, just not really convinced that it's a good idea to combine the operations. Greetings, Andres Freund
Re: Vectored I/O in bulk_write.c
I canvassed Andres off-list since smgrzeroextend() is his invention, and he wondered if it was a good idea to blur the distinction between the different zero-extension strategies like that. Good question. My take is that it's fine: mdzeroextend() already uses fallocate() only for nblocks > 8, but otherwise writes out zeroes, because that was seen to interact better with file system heuristics on common systems. That preserves current behaviour for callers of plain-old sgmrextend(), which becomes a wrapper for smgrwrite(..., 1, _ZERO | _EXTEND). If some hypothetical future caller wants to be able to call smgrwritev(..., NULL, 9 blocks, _ZERO | _EXTEND) directly without triggering the fallocate() strategy for some well researched reason, we could add a new flag to say so, ie adjust that gating. In other words, we have already blurred the semantics. To me, it seems nicer to have a single high level interface for the same logical operation, and flags to select strategies more explicitly if that is eventually necessary.
Re: Vectored I/O in bulk_write.c
On Thu, Mar 14, 2024 at 10:49 AM Heikki Linnakangas wrote: > I tried to say that smgr implementation might have better ways to assert > that than calling smgrnblocks(), so it would be better to leave it to > the implementation. But what bothered me most was that smgrwrite() had a > different signature than mdwrite(). I'm happy with the way you have it > in the v4 patch. Looking for other things that can be hoisted up to smgr.c level because they are part of the contract or would surely need to be duplicated by any implementation: I think the check that you don't exceed the maximum possible block number could be there too, no? Here's a version like that. Does anyone else want to object to doing this for 17? Probably still needs some cleanup eg comments etc that may be lurking around the place and another round of testing. As for the overall idea, I'll leave it a few days and see if others have feedback. My take is that this is what bulk_write.c was clearly intended to do, it's just that smgr let it down by not allowing vectored extension yet. It's a fairly mechanical simplification, generalisation, and net code reduction to do so by merging, like this. From 61b351b60d22060e5fc082645cdfc19188ac4841 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 9 Mar 2024 16:04:21 +1300 Subject: [PATCH v5 1/3] Use smgrwritev() for both overwriting and extending. Since mdwrite() and mdextend() were basically the same and both need vectored variants, merge them into a single interface to reduce duplication. We still want to be able to assert that callers know the difference between overwriting and extending and activate slightly different behavior during recovery, so use a new flags argument to control that. Likewise for the zero-extending variant, which is has much in common at the interface level, except it doesn't deal in buffers. The traditional single-block smgrwrite() and smgrextend() functions with skipFsync boolean argument are translated to smgrwritev() by inlinable wrapper functions, for low-overhead backwards-compatibility. Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/CA%2BhUKGLx5bLwezZKAYB2O_qHj%3Dov10RpgRVY7e8TSJVE74oVjg%40mail.gmail.com --- src/backend/storage/buffer/bufmgr.c | 15 +-- src/backend/storage/buffer/localbuf.c | 3 +- src/backend/storage/smgr/md.c | 136 src/backend/storage/smgr/smgr.c | 145 +++--- src/include/storage/md.h | 7 +- src/include/storage/smgr.h| 22 ++-- 6 files changed, 110 insertions(+), 218 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f0f8d4259c..6caf35a9d6 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -2064,7 +2064,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, io_start = pgstat_prepare_io_time(track_io_timing); /* - * Note: if smgrzeroextend fails, we will end up with buffers that are + * Note: if smgrwritev fails, we will end up with buffers that are * allocated but not marked BM_VALID. The next relation extension will * still select the same block number (because the relation didn't get any * longer on disk) and so future attempts to extend the relation will find @@ -2073,7 +2073,8 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, * * We don't need to set checksum for all-zero pages. */ - smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false); + smgrwritev(bmr.smgr, fork, first_block, NULL, extend_by, + SMGR_WRITE_EXTEND | SMGR_WRITE_ZERO); /* * Release the file-extension lock; it's now OK for someone else to extend @@ -3720,11 +3721,11 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, * * In recovery, we cache the value returned by the first lseek(SEEK_END) * and the future writes keeps the cached value up-to-date. See - * smgrextend. It is possible that the value of the first lseek is smaller - * than the actual number of existing blocks in the file due to buggy - * Linux kernels that might not have accounted for the recent write. But - * that should be fine because there must not be any buffers after that - * file size. + * smgrwritev(). It is possible that the value of the first lseek is + * smaller than the actual number of existing blocks in the file due to + * buggy Linux kernels that might not have accounted for the recent write. + * But that should be fine because there must not be any buffers after + * that file size. */ for (i = 0; i < nforks; i++) { diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index fcfac335a5..5b2b0fe9f4 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -416,7 +416,8 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr, io_start = pgstat_prepare_io_time(track_io_timing); /* actually extend relation */ - smgrzeroe
Re: Vectored I/O in bulk_write.c
On 13/03/2024 23:12, Thomas Munro wrote: Alright, here is a first attempt at merging all three interfaces as you suggested. I like it! I especially like the way it removes lots of duplication. I don't understand your argument about the location of the write-vs-extent assertions. It seems to me that these are assertions about what the *public* smgrnblocks() function returns. In other words, we assert that the caller is aware of the current relation size (and has some kind of interlocking scheme for that to be possible), according to the smgr implementation's public interface. That's not an assertion about internal details of the smgr implementation, it's part of the "contract" for the API. I tried to say that smgr implementation might have better ways to assert that than calling smgrnblocks(), so it would be better to leave it to the implementation. But what bothered me most was that smgrwrite() had a different signature than mdwrite(). I'm happy with the way you have it in the v4 patch. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Vectored I/O in bulk_write.c
Alright, here is a first attempt at merging all three interfaces as you suggested. I like it! I especially like the way it removes lots of duplication. I don't understand your argument about the location of the write-vs-extent assertions. It seems to me that these are assertions about what the *public* smgrnblocks() function returns. In other words, we assert that the caller is aware of the current relation size (and has some kind of interlocking scheme for that to be possible), according to the smgr implementation's public interface. That's not an assertion about internal details of the smgr implementation, it's part of the "contract" for the API. From 0a57274e29369e61712941e379c24f7db1dec068 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 9 Mar 2024 16:04:21 +1300 Subject: [PATCH v4 1/3] Merge smgrzeroextend() and smgrextend() with smgrwritev(). Since mdwrite() and mdextend() were basically the same and both need vectored variants, merge them into a single interface. We still want to be able to assert that callers know the difference between overwriting and extending and activate slightly difference behavior during recovery, so use flags to control that. Likewise for the zero-extending variant, which is has much in common at the interface level, except it doesn't deal in buffers. The traditional single-block smgrwrite() and smgrextend() functions with skipFsync boolean argument are translated to smgrwritev() by inlinable wrapper functions, for low-overhead backwards-compatibility. Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/CA%2BhUKGLx5bLwezZKAYB2O_qHj%3Dov10RpgRVY7e8TSJVE74oVjg%40mail.gmail.com --- src/backend/storage/buffer/bufmgr.c | 7 +- src/backend/storage/buffer/localbuf.c | 3 +- src/backend/storage/smgr/md.c | 119 +--- src/backend/storage/smgr/smgr.c | 127 +- src/include/storage/md.h | 7 +- src/include/storage/smgr.h| 22 +++-- 6 files changed, 91 insertions(+), 194 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f0f8d4259c..52bbdff336 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -2064,7 +2064,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, io_start = pgstat_prepare_io_time(track_io_timing); /* - * Note: if smgrzeroextend fails, we will end up with buffers that are + * Note: if smgrwritev fails, we will end up with buffers that are * allocated but not marked BM_VALID. The next relation extension will * still select the same block number (because the relation didn't get any * longer on disk) and so future attempts to extend the relation will find @@ -2073,7 +2073,8 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, * * We don't need to set checksum for all-zero pages. */ - smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false); + smgrwritev(bmr.smgr, fork, first_block, NULL, extend_by, + SMGR_WRITE_EXTEND | SMGR_WRITE_ZERO); /* * Release the file-extension lock; it's now OK for someone else to extend @@ -3720,7 +3721,7 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, * * In recovery, we cache the value returned by the first lseek(SEEK_END) * and the future writes keeps the cached value up-to-date. See - * smgrextend. It is possible that the value of the first lseek is smaller + * smgrwritev(). It is possible that the value of the first lseek is smaller * than the actual number of existing blocks in the file due to buggy * Linux kernels that might not have accounted for the recent write. But * that should be fine because there must not be any buffers after that diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index fcfac335a5..5b2b0fe9f4 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -416,7 +416,8 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr, io_start = pgstat_prepare_io_time(track_io_timing); /* actually extend relation */ - smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false); + smgrwritev(bmr.smgr, fork, first_block, NULL, extend_by, + SMGR_WRITE_EXTEND | SMGR_WRITE_ZERO); pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND, io_start, extend_by); diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index bf0f3ca76d..0d560393e5 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -447,83 +447,14 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo) pfree(path); } -/* - * mdextend() -- Add a block to the specified relation. - * - * The semantics are nearly the same as mdwrite(): write at the - * specified position. However, this is to be used for the case of - * extending a relation (i.e., blocknum is at or beyond the curren
Re: Vectored I/O in bulk_write.c
On 13/03/2024 12:18, Thomas Munro wrote: On Wed, Mar 13, 2024 at 9:57 PM Heikki Linnakangas wrote: Here also is a first attempt at improving the memory allocation and memory layout. ... +typedef union BufferSlot +{ + PGIOAlignedBlock buffer; + dlist_node freelist_node; +}BufferSlot; + If you allocated the buffers in one large contiguous chunk, you could often do one large write() instead of a gathered writev() of multiple blocks. That should be even better, although I don't know much of a difference it makes. The above layout wastes a fair amount memory too, because 'buffer' is I/O aligned. The patch I posted has an array of buffers with the properties you describe, so you get a pwrite() (no 'v') sometimes, and a pwritev() with a small iovcnt when it wraps around: Oh I missed that it was "union BufferSlot". I thought it was a struct. Never mind then. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Vectored I/O in bulk_write.c
On Wed, Mar 13, 2024 at 9:57 PM Heikki Linnakangas wrote: > Let's bite the bullet and merge the smgrwrite and smgrextend functions > at the smgr level too. I propose the following signature: > > #define SWF_SKIP_FSYNC 0x01 > #define SWF_EXTEND 0x02 > #define SWF_ZERO0x04 > > void smgrwritev(SMgrRelation reln, ForkNumber forknum, > BlockNumber blocknum, > const void **buffer, BlockNumber nblocks, > int flags); > > This would replace smgwrite, smgrextend, and smgrzeroextend. The That sounds pretty good to me. > > Here also is a first attempt at improving the memory allocation and > > memory layout. > > ... > > +typedef union BufferSlot > > +{ > > + PGIOAlignedBlock buffer; > > + dlist_node freelist_node; > > +}BufferSlot; > > + > > If you allocated the buffers in one large contiguous chunk, you could > often do one large write() instead of a gathered writev() of multiple > blocks. That should be even better, although I don't know much of a > difference it makes. The above layout wastes a fair amount memory too, > because 'buffer' is I/O aligned. The patch I posted has an array of buffers with the properties you describe, so you get a pwrite() (no 'v') sometimes, and a pwritev() with a small iovcnt when it wraps around: pwrite(...) = 131072 (0x2) pwritev(...,3,...) = 131072 (0x2) pwrite(...) = 131072 (0x2) pwritev(...,3,...) = 131072 (0x2) pwrite(...) = 131072 (0x2) Hmm, I expected pwrite() alternating with pwritev(iovcnt=2), the latter for when it wraps around the buffer array, so I'm not sure why it's 3. I guess the btree code isn't writing them strictly monotonically or something... I don't believe it wastes any memory on padding (except a few bytes wasted by palloc_aligned() before BulkWriteState): (gdb) p &bulkstate->buffer_slots[0] $4 = (BufferSlot *) 0x15c731cb4000 (gdb) p &bulkstate->buffer_slots[1] $5 = (BufferSlot *) 0x15c731cb6000 (gdb) p sizeof(bulkstate->buffer_slots[0]) $6 = 8192
Re: Vectored I/O in bulk_write.c
(Replying to all your messages in this thread together) This made me wonder why smgrwritev() and smgrextendv() shouldn't be backed by the same implementation, since they are essentially the same operation. +1 to the general idea of merging the write and extend functions. The differences are some assertions which might as well be moved up to the smgr.c level as they must surely apply to any future smgr implementation too, right?, and the segment file creation policy which can be controlled with a new argument. I tried that here. Currently, smgrwrite/smgrextend just calls through to mdwrite/mdextend. I'd like to keep it that way. Otherwise we need to guess what a hypothetical smgr implementation might look like. For example this assertion: /* This assert is too expensive to have on normally ... */ #ifdef CHECK_WRITE_VS_EXTEND Assert(blocknum >= mdnblocks(reln, forknum)); #endif I think that should continue to be md.c's internal affair. For example, imagine that you had a syscall like write() but which always writes to the end of the file and also returns the offset that the data was written to. You would want to assert the returned offset instead of the above. So -1 on moving up the assertions to smgr.c. Let's bite the bullet and merge the smgrwrite and smgrextend functions at the smgr level too. I propose the following signature: #define SWF_SKIP_FSYNC 0x01 #define SWF_EXTEND 0x02 #define SWF_ZERO0x04 void smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void **buffer, BlockNumber nblocks, int flags); This would replace smgwrite, smgrextend, and smgrzeroextend. The mdwritev() function would have the same signature. A single 'flags' arg looks better in the callers than booleans, because you don't need to remember what 'true' or 'false' means. The callers would look like this: /* like old smgrwrite(reln, MAIN_FORKNUM, 123, buf, false) */ smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, 0); /* like old smgrwrite(reln, MAIN_FORKNUM, 123, buf, true) */ smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, SWF_SKIP_FSYNC); /* like old smgrextend(reln, MAIN_FORKNUM, 123, buf, true) */ smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, SWF_EXTEND | SWF_SKIP_FSYNC); /* like old smgrzeroextend(reln, MAIN_FORKNUM, 123, 10, true) */ smgrwritev(reln, MAIN_FORKNUM, 123, NULL, 10, SWF_EXTEND | SWF_ZERO | SWF_SKIP_FSYNC); While thinking about that I realised that an existing write-or-extend assertion in master is wrong because it doesn't add nblocks. Hmm, it's a bit weird that we have nblocks as int or BlockNumber in various places, which I think should probably be fixed. +1 Here also is a first attempt at improving the memory allocation and memory layout. ... +typedef union BufferSlot +{ + PGIOAlignedBlock buffer; + dlist_node freelist_node; +} BufferSlot; + If you allocated the buffers in one large contiguous chunk, you could often do one large write() instead of a gathered writev() of multiple blocks. That should be even better, although I don't know much of a difference it makes. The above layout wastes a fair amount memory too, because 'buffer' is I/O aligned. I wonder if bulk logging should trigger larger WAL writes in the "Have to write it ourselves" case in AdvanceXLInsertBuffer(), since writing 8kB of WAL at a time seems like an unnecessarily degraded level of performance, especially with wal_sync_method=open_datasync. Of course the real answer is "make sure wal_buffers is high enough for your workload" (usually indirectly by automatically scaling from shared_buffers), but this problem jumps out when tracing bulk_writes.c with default settings. We write out the index 128kB at a time, but the WAL 8kB at a time. Makes sense. On 12/03/2024 23:38, Thomas Munro wrote: One more observation while I'm thinking about bulk_write.c... hmm, it writes the data out and asks the checkpointer to fsync it, but doesn't call smgrwriteback(). I assume that means that on Linux the physical writeback sometimes won't happen until the checkpointer eventually calls fsync() sequentially, one segment file at a time. I see that it's difficult to decide how to do that though; unlike checkpoints, which have rate control/spreading, bulk writes could more easily flood the I/O subsystem in a burst. I expect most non-Linux systems' write-behind heuristics to fire up for bulk sequential writes, but on Linux where most users live, there is no write-behind heuristic AFAIK (on the most common file systems anyway), so you have to crank that handle if you want it to wake up and smell the coffee before it hits internal limits, but then you have to decide how fast to crank it. This problem will come into closer focus when we start talking about streaming writes. For the current non-streaming bulk_write.c coding, I don't have
Re: Vectored I/O in bulk_write.c
One more observation while I'm thinking about bulk_write.c... hmm, it writes the data out and asks the checkpointer to fsync it, but doesn't call smgrwriteback(). I assume that means that on Linux the physical writeback sometimes won't happen until the checkpointer eventually calls fsync() sequentially, one segment file at a time. I see that it's difficult to decide how to do that though; unlike checkpoints, which have rate control/spreading, bulk writes could more easily flood the I/O subsystem in a burst. I expect most non-Linux systems' write-behind heuristics to fire up for bulk sequential writes, but on Linux where most users live, there is no write-behind heuristic AFAIK (on the most common file systems anyway), so you have to crank that handle if you want it to wake up and smell the coffee before it hits internal limits, but then you have to decide how fast to crank it. This problem will come into closer focus when we start talking about streaming writes. For the current non-streaming bulk_write.c coding, I don't have any particular idea of what to do about that, so I'm just noting the observation here. Sorry for the sudden wall of text/monologue; this is all a sort of reaction to bulk_write.c that I should perhaps have sent to the bulk_write.c thread, triggered by a couple of debugging sessions.
Re: Vectored I/O in bulk_write.c
Here also is a first attempt at improving the memory allocation and memory layout. I wonder if bulk logging should trigger larger WAL writes in the "Have to write it ourselves" case in AdvanceXLInsertBuffer(), since writing 8kB of WAL at a time seems like an unnecessarily degraded level of performance, especially with wal_sync_method=open_datasync. Of course the real answer is "make sure wal_buffers is high enough for your workload" (usually indirectly by automatically scaling from shared_buffers), but this problem jumps out when tracing bulk_writes.c with default settings. We write out the index 128kB at a time, but the WAL 8kB at a time. From 793caf2db6c00314f5bd8f7146d4797508f2f627 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 9 Mar 2024 16:04:21 +1300 Subject: [PATCH v3 1/3] Provide vectored variant of smgrextend(). Since mdwrite() and mdextend() were basically the same, merge them. They had different assertions, but those would surely apply to any implementation of smgr, so move them up to the smgr.c wrapper level. The other difference was the policy on segment creation, but that can be captured by having smgrwritev() and smgrextendv() call a single mdwritev() function with a new "extend" flag. Discussion: https://postgr.es/m/CA%2BhUKGLx5bLwezZKAYB2O_qHj%3Dov10RpgRVY7e8TSJVE74oVjg%40mail.gmail.com --- src/backend/storage/smgr/md.c | 106 +--- src/backend/storage/smgr/smgr.c | 32 ++ src/include/storage/md.h| 3 +- src/include/storage/smgr.h | 12 +++- 4 files changed, 47 insertions(+), 106 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index bf0f3ca76d1..9b125841226 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -447,79 +447,10 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo) pfree(path); } -/* - * mdextend() -- Add a block to the specified relation. - * - * The semantics are nearly the same as mdwrite(): write at the - * specified position. However, this is to be used for the case of - * extending a relation (i.e., blocknum is at or beyond the current - * EOF). Note that we assume writing a block beyond current EOF - * causes intervening file space to become filled with zeroes. - */ -void -mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - const void *buffer, bool skipFsync) -{ - off_t seekpos; - int nbytes; - MdfdVec*v; - - /* If this build supports direct I/O, the buffer must be I/O aligned. */ - if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ) - Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); - - /* This assert is too expensive to have on normally ... */ -#ifdef CHECK_WRITE_VS_EXTEND - Assert(blocknum >= mdnblocks(reln, forknum)); -#endif - - /* - * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any - * more --- we mustn't create a block whose number actually is - * InvalidBlockNumber. (Note that this failure should be unreachable - * because of upstream checks in bufmgr.c.) - */ - if (blocknum == InvalidBlockNumber) - ereport(ERROR, -(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg("cannot extend file \"%s\" beyond %u blocks", - relpath(reln->smgr_rlocator, forknum), - InvalidBlockNumber))); - - v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE); - - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); - - Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); - - if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ) - { - if (nbytes < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not extend file \"%s\": %m", - FilePathName(v->mdfd_vfd)), - errhint("Check free disk space."))); - /* short write: complain appropriately */ - ereport(ERROR, -(errcode(ERRCODE_DISK_FULL), - errmsg("could not extend file \"%s\": wrote only %d of %d bytes at block %u", - FilePathName(v->mdfd_vfd), - nbytes, BLCKSZ, blocknum), - errhint("Check free disk space."))); - } - - if (!skipFsync && !SmgrIsTemp(reln)) - register_dirty_segment(reln, forknum, v); - - Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); -} - /* * mdzeroextend() -- Add new zeroed out blocks to the specified relation. * - * Similar to mdextend(), except the relation can be extended by multiple - * blocks at once and the added blocks will be filled with zeroes. + * The added blocks will be filled with zeroes. */ void mdzeroextend(SMgrRelation reln, ForkNumber forknum, @@ -595,13 +526,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum, { int ret; - /* - * Even if we don't want to use fallocate, we can still extend a - * bit more efficiently than writing each 8kB block individually. - * pg_pwrite_zeros() (via FileZero()) uses pg_pwritev_with_retry() - * to avoid multiple wri
Re: Vectored I/O in bulk_write.c
Slightly better version, adjusting a few obsoleted comments, adjusting error message to distinguish write/extend, fixing a thinko in smgr_cached_nblocks maintenance. From c786f979b0c38364775e32b9403b79303507d37b Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 9 Mar 2024 16:04:21 +1300 Subject: [PATCH v2 1/2] Provide vectored variant of smgrextend(). Since mdwrite() and mdextend() were basically the same, merge them. They had different assertions, but those would surely apply to any implementation of smgr, so move them up to the smgr.c wrapper level. The other difference was the policy on segment creation, but that can be captured by having smgrwritev() and smgrextendv() call a single mdwritev() function with a new "extend" flag. --- src/backend/storage/smgr/md.c | 106 +--- src/backend/storage/smgr/smgr.c | 32 ++ src/include/storage/md.h| 3 +- src/include/storage/smgr.h | 12 +++- 4 files changed, 47 insertions(+), 106 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index bf0f3ca76d..9b12584122 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -447,79 +447,10 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo) pfree(path); } -/* - * mdextend() -- Add a block to the specified relation. - * - * The semantics are nearly the same as mdwrite(): write at the - * specified position. However, this is to be used for the case of - * extending a relation (i.e., blocknum is at or beyond the current - * EOF). Note that we assume writing a block beyond current EOF - * causes intervening file space to become filled with zeroes. - */ -void -mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - const void *buffer, bool skipFsync) -{ - off_t seekpos; - int nbytes; - MdfdVec*v; - - /* If this build supports direct I/O, the buffer must be I/O aligned. */ - if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ) - Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); - - /* This assert is too expensive to have on normally ... */ -#ifdef CHECK_WRITE_VS_EXTEND - Assert(blocknum >= mdnblocks(reln, forknum)); -#endif - - /* - * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any - * more --- we mustn't create a block whose number actually is - * InvalidBlockNumber. (Note that this failure should be unreachable - * because of upstream checks in bufmgr.c.) - */ - if (blocknum == InvalidBlockNumber) - ereport(ERROR, -(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg("cannot extend file \"%s\" beyond %u blocks", - relpath(reln->smgr_rlocator, forknum), - InvalidBlockNumber))); - - v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE); - - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); - - Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); - - if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ) - { - if (nbytes < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not extend file \"%s\": %m", - FilePathName(v->mdfd_vfd)), - errhint("Check free disk space."))); - /* short write: complain appropriately */ - ereport(ERROR, -(errcode(ERRCODE_DISK_FULL), - errmsg("could not extend file \"%s\": wrote only %d of %d bytes at block %u", - FilePathName(v->mdfd_vfd), - nbytes, BLCKSZ, blocknum), - errhint("Check free disk space."))); - } - - if (!skipFsync && !SmgrIsTemp(reln)) - register_dirty_segment(reln, forknum, v); - - Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); -} - /* * mdzeroextend() -- Add new zeroed out blocks to the specified relation. * - * Similar to mdextend(), except the relation can be extended by multiple - * blocks at once and the added blocks will be filled with zeroes. + * The added blocks will be filled with zeroes. */ void mdzeroextend(SMgrRelation reln, ForkNumber forknum, @@ -595,13 +526,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum, { int ret; - /* - * Even if we don't want to use fallocate, we can still extend a - * bit more efficiently than writing each 8kB block individually. - * pg_pwrite_zeros() (via FileZero()) uses pg_pwritev_with_retry() - * to avoid multiple writes or needing a zeroed buffer for the - * whole length of the extension. - */ + /* Fall back to writing out zeroes. */ ret = FileZero(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * numblocks, WAIT_EVENT_DATA_FILE_EXTEND); @@ -920,19 +845,14 @@ mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, /* * mdwritev() -- Write the supplied blocks at the appropriate location. * - * This is to be used only for updating already-existing blocks of a - * relation (ie, those before the current EOF). To extend a relation, - * use mdextend(). + * Note tha