Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-03-07 15:44:46 +1300, Thomas Munro wrote: > On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro wrote: > > Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you > > initialised that struct. I guess it wants {{0}} instead of {0}. > > Apparently old GCC was wrong about that warning[1], but that system > > doesn't have the back-patched fixes? Not sure. > > Oh, you already pushed a fix. But now I'm wondering if it's useful to > have old buggy compilers set to run with -Werror. I think it's actively harmful to do so. Avoiding warnings on a > 10 year old compiler a) is a waste of time b) unnecessarily requires making our code uglier. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On 2023-Mar-07, Julien Rouhaud wrote: > I registered lapwing as a 32b Debian 7 so I thought it would be expected to > keep it as-is rather than upgrading to all newer major Debian versions, > especially since there were newer debian animal registered (no 32b though > AFAICS). I'm not opposed to upgrading it but I think there's still value in > having somewhat old packages versions being tested, especially since there > isn't much 32b coverage of those. I would be happy to register a newer 32b > version, or even sid, if needed but the -m32 part on the CI makes me think > there isn't much value doing that now. I think a pure 32bit animal running contemporary Debian would be better than just ditching the animal completely, as would appear to be the alternative, precisely because we have no other 32bit machine running x86 Linux. Maybe you can have *two* animals on the same machine: one running the old Debian without -Werror, and the one with new Debian and that flag kept. I think CI is not a replacement for the buildfarm. It helps catche some problems earlier, but we shouldn't think that we no longer need some buildfarm animals because CI runs those configs. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Mar 7, 2023 at 7:47 PM Julien Rouhaud wrote: > I registered lapwing as a 32b Debian 7 so I thought it would be expected to > keep it as-is rather than upgrading to all newer major Debian versions, > especially since there were newer debian animal registered (no 32b though > AFAICS). Animals do get upgraded: see the "w. e. f." ("with effect from") line in https://buildfarm.postgresql.org/cgi-bin/show_members.pl which comes from people running something like ./update_personality.pl --os-version "11" so that it shows up on the website. > I'm not opposed to upgrading it but I think there's still value in > having somewhat old packages versions being tested, especially since there > isn't much 32b coverage of those. I would be happy to register a newer 32b > version, or even sid, if needed but the -m32 part on the CI makes me think > there isn't much value doing that now. Totally up to you as an animal zoo keeper but in my humble opinion the interesting range of Debian releases currently is 11-13, or maybe 10 if you really want to test the LTS/old-stable release (and CI is testing 11). > I think this is the first time that a problem raised by -Werror on that old > animal is actually a false positive, while there were many times it reported > useful stuff. Now this has been up for years before we got better CI tooling, > especially with -m32 support, so there might not be any value to have it > anymore. As I mentioned at [1] I don't mind removing it or just work on > upgrading any dependency (or removing known buggy compiler flags) to keep it > without being annoying. In any case I'm usually quite fast at reacting to any > problem/complaint on that animal, so you don't have to worry about the > buildfarm being red too long if it came to that. Yeah, it's given us lots of useful data, thanks. Personally I would upgrade it so it keeps telling us useful things but I feel like I've said enough about that so I'll shut up now :-) Re: being red too long... yeah that reminds me, I really need to fix seawasp ASAP...
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Mar 07, 2023 at 07:14:51PM +1300, Thomas Munro wrote: > On Tue, Mar 7, 2023 at 5:32 PM Michael Paquier wrote: > > On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote: > > > Oh, you already pushed a fix. But now I'm wondering if it's useful to > > > have old buggy compilers set to run with -Werror. > > > > Yes, as far as I can see when investigating the issue, this is an old > > bug of gcc when detecting where the initialization needs to be > > applied. And at the same time the fix is deadly simple, so the > > current statu-quo does not sound that bad to me. Note that lapwing is > > one of the only animals testing 32b builds, and it has saved from > > quite few bugs over the years. > > Yeah, but I'm just wondering, why not run a current release on it[1]? > Debian is one of the few distributions still supporting 32 bit > kernels, and it's good to test rare things, but AFAIK the primary > reason we finish up with EOL'd OSes in the 'farm is because they have > been forgotten (the secondary reason is because they couldn't be > upgraded because the OS dropped the [micro]architecture). I registered lapwing as a 32b Debian 7 so I thought it would be expected to keep it as-is rather than upgrading to all newer major Debian versions, especially since there were newer debian animal registered (no 32b though AFAICS). I'm not opposed to upgrading it but I think there's still value in having somewhat old packages versions being tested, especially since there isn't much 32b coverage of those. I would be happy to register a newer 32b version, or even sid, if needed but the -m32 part on the CI makes me think there isn't much value doing that now. Now about the -Werror: > BTW CI also tests 32 bit with -m32 on Debian, but with a 64 bit > kernel, which probably doesn't change much at the level we care about, > so maybe this doesn't matter much... just sharing an observation that > we're wasting time thinking about an OS release that gave up the ghost > in 2016, because it is running with -Werror. *shrug* I think this is the first time that a problem raised by -Werror on that old animal is actually a false positive, while there were many times it reported useful stuff. Now this has been up for years before we got better CI tooling, especially with -m32 support, so there might not be any value to have it anymore. As I mentioned at [1] I don't mind removing it or just work on upgrading any dependency (or removing known buggy compiler flags) to keep it without being annoying. In any case I'm usually quite fast at reacting to any problem/complaint on that animal, so you don't have to worry about the buildfarm being red too long if it came to that. [1] https://www.postgresql.org/message-id/20220921155025.wdixzbrt2uzbi6vz%40jrouhaud
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Mar 7, 2023 at 5:32 PM Michael Paquier wrote: > On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote: > > Oh, you already pushed a fix. But now I'm wondering if it's useful to > > have old buggy compilers set to run with -Werror. > > Yes, as far as I can see when investigating the issue, this is an old > bug of gcc when detecting where the initialization needs to be > applied. And at the same time the fix is deadly simple, so the > current statu-quo does not sound that bad to me. Note that lapwing is > one of the only animals testing 32b builds, and it has saved from > quite few bugs over the years. Yeah, but I'm just wondering, why not run a current release on it[1]? Debian is one of the few distributions still supporting 32 bit kernels, and it's good to test rare things, but AFAIK the primary reason we finish up with EOL'd OSes in the 'farm is because they have been forgotten (the secondary reason is because they couldn't be upgraded because the OS dropped the [micro]architecture). Unlike vintage SPARC, actual users might plausibly be running a current release on a 32 bit Intel system, I guess (maybe on a Quark microcontroller?)? BTW CI also tests 32 bit with -m32 on Debian, but with a 64 bit kernel, which probably doesn't change much at the level we care about, so maybe this doesn't matter much... just sharing an observation that we're wasting time thinking about an OS release that gave up the ghost in 2016, because it is running with -Werror. *shrug* [1] https://wiki.debian.org/DebianReleases
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Mar 06, 2023 at 02:29:50PM -0800, Andres Freund wrote: > Thanks. Sure, no problem. If there is anything else needed for this thread, feel free to ping me here. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote: > On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro wrote: >> Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you >> initialised that struct. I guess it wants {{0}} instead of {0}. >> Apparently old GCC was wrong about that warning[1], but that system >> doesn't have the back-patched fixes? Not sure. 6392f2a was one such case. > Oh, you already pushed a fix. But now I'm wondering if it's useful to > have old buggy compilers set to run with -Werror. Yes, as far as I can see when investigating the issue, this is an old bug of gcc when detecting where the initialization needs to be applied. And at the same time the fix is deadly simple, so the current statu-quo does not sound that bad to me. Note that lapwing is one of the only animals testing 32b builds, and it has saved from quite few bugs over the years. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro wrote: > Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you > initialised that struct. I guess it wants {{0}} instead of {0}. > Apparently old GCC was wrong about that warning[1], but that system > doesn't have the back-patched fixes? Not sure. Oh, you already pushed a fix. But now I'm wondering if it's useful to have old buggy compilers set to run with -Werror.
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Mar 6, 2023 at 5:30 PM Michael Paquier wrote: > On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote: > > I ran some tests on my dev system [1] and I don't see much difference > > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from > > Andres to keep the code simple and elegant. > > This thread has stalled for a couple of weeks, so I have gone back to > it. Testing on a tmpfs I am not seeing a difference if performance > for any of the approaches discussed. At the end, as I am the one > behind the introduction of pg_pwrite_zeros(), I have applied v3 after > switches the size and offset parameters to be the same way as in v4. Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you initialised that struct. I guess it wants {{0}} instead of {0}. Apparently old GCC was wrong about that warning[1], but that system doesn't have the back-patched fixes? Not sure. [1] https://stackoverflow.com/questions/63355760/how-standard-is-the-0-initializer-in-c89
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-03-06 13:29:50 +0900, Michael Paquier wrote: > On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote: > > I ran some tests on my dev system [1] and I don't see much difference > > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from > > Andres to keep the code simple and elegant. > > This thread has stalled for a couple of weeks, so I have gone back to > it. Testing on a tmpfs I am not seeing a difference if performance > for any of the approaches discussed. At the end, as I am the one > behind the introduction of pg_pwrite_zeros(), I have applied v3 after > switches the size and offset parameters to be the same way as in v4. Thanks. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote: > I ran some tests on my dev system [1] and I don't see much difference > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from > Andres to keep the code simple and elegant. This thread has stalled for a couple of weeks, so I have gone back to it. Testing on a tmpfs I am not seeing a difference if performance for any of the approaches discussed. At the end, as I am the one behind the introduction of pg_pwrite_zeros(), I have applied v3 after switches the size and offset parameters to be the same way as in v4. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Feb 20, 2023 at 11:03 AM Michael Paquier wrote: > > > I'm inclined to go with my version, with the argument order swapped to > > Bharath's order. > > Okay. That's fine by me. I ran some tests on my dev system [1] and I don't see much difference between v3 and v4. So, +1 for v3 patch (+ argument order swap) from Andres to keep the code simple and elegant. [1] HEAD: 16MB (12.231 ms), 8190 Bytes (0.199 ms), 8192 Bytes (0.176 ms), 1GB (603.668 ms), 10GB (21184.936 ms (00:21.185)) v3 patch: 16MB (12.632 ms), 8190 Bytes (0.183 ms), 8192 Bytes (0.166 ms), 1GB (610.428 ms), 10GB (22647.308 ms (00:22.647)) v4 patch: 16MB (12.044 ms), 8190 Bytes (0.167 ms), 8192 Bytes (0.139 ms), 1GB (603.848 ms), 10GB (21225.331 ms (00:21.225)) -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Feb 17, 2023 at 09:31:14AM -0800, Andres Freund wrote: > On 2023-02-17 16:19:46 +0900, Michael Paquier wrote: >> But it looks like I misunderstood what this quote meant compared to >> what v3 does. It is true that v3 sets iov_len and iov_base more than >> needed when writing sizes larger than BLCKSZ. > > I don't think it does for writes larger than BLCKSZ, it just does more for > writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going > to be spent elsewhere. Yep. >> Seems like you think that it is not really going to matter much to track >> which iovecs have been already initialized during the first loop on >> pg_pwritev_with_retry() to keep the code shorter? > > Yes. I'd bet that, in the unlikely case you're going to see any difference at > all, unconditionally initializing is going to win. > > Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every > call. Even if we could do some further optimizations of what I did in the > patch, you can initialize needed IOVs repeatedly a *lot* of times, before it > shows up... > > I'm inclined to go with my version, with the argument order swapped to > Bharath's order. Okay. That's fine by me. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-17 16:19:46 +0900, Michael Paquier wrote: > But it looks like I misunderstood what this quote meant compared to > what v3 does. It is true that v3 sets iov_len and iov_base more than > needed when writing sizes larger than BLCKSZ. I don't think it does for writes larger than BLCKSZ, it just does more for writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going to be spent elsewhere. > Seems like you think that it is not really going to matter much to track > which iovecs have been already initialized during the first loop on > pg_pwritev_with_retry() to keep the code shorter? Yes. I'd bet that, in the unlikely case you're going to see any difference at all, unconditionally initializing is going to win. Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every call. Even if we could do some further optimizations of what I did in the patch, you can initialize needed IOVs repeatedly a *lot* of times, before it shows up... I'm inclined to go with my version, with the argument order swapped to Bharath's order. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Thu, Feb 16, 2023 at 11:00:20AM -0800, Andres Freund wrote: > I don't really understand this bit? As of this message, I saw this quote: https://www.postgresql.org/message-id/fcalj2acxebwy_bm3kmzekypcxsm+ygitpyhi4fdt6msk6yrt...@mail.gmail.com "However, it increases the number of iovec initialization with zerobuf for the cases when pg_pwrite_zeros is called for sizes far greater than BLCKSZ (for instance, WAL file initialization)." But it looks like I misunderstood what this quote meant compared to what v3 does. It is true that v3 sets iov_len and iov_base more than needed when writing sizes larger than BLCKSZ. Seems like you think that it is not really going to matter much to track which iovecs have been already initialized during the first loop on pg_pwritev_with_retry() to keep the code shorter? -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-16 16:58:23 +0900, Michael Paquier wrote: > On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote: > > The v3 patch reduces initialization of iovec array elements which is a > > clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ > > many times (I assume this is what is needed for the relation extension > > lock improvements feature). However, it increases the number of iovec > > initialization with zerobuf for the cases when pg_pwrite_zeros is > > called for sizes far greater than BLCKSZ (for instance, WAL file > > initialization). In those cases the cost of initializing the IOV doesn't matter, relative to the other costs. The important point is to not initialize a lot of elements if they're not even needed. Because we need to overwrite the trailing iov element, it doesn't seem worth to try to "pre-initialize" iov. Referencing a static variable is more expensive than accessing an on-stack variable. Having a first-call check is more expensive than not having it. Thus making the iov and zbuf_sz static isn't helpful. Particularly the latter seems like a bad idea, because it's a compiler constant. > It seems to me that v3 would do extra initializations only if > pg_pwritev_with_retry() does *not* retry its writes, but that's not > the case as it retries on a partial write as per its name. The number > of iov buffers is stricly capped by remaining_size. I don't really understand this bit? Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote: > The v3 patch reduces initialization of iovec array elements which is a > clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ > many times (I assume this is what is needed for the relation extension > lock improvements feature). However, it increases the number of iovec > initialization with zerobuf for the cases when pg_pwrite_zeros is > called for sizes far greater than BLCKSZ (for instance, WAL file > initialization). It seems to me that v3 would do extra initializations only if pg_pwritev_with_retry() does *not* retry its writes, but that's not the case as it retries on a partial write as per its name. The number of iov buffers is stricly capped by remaining_size. FWIW, I find v3 proposed more elegant. > FWIW, I attached v4 patch, a simplified version of the v2 - it > initializes all the iovec array elements if the total blocks to be > written crosses lengthof(iovec array), otherwise it initializes only > the needed blocks. + static size_t zbuf_sz = BLCKSZ; In v4, what's the advantage of marking that as static? It could actually be dangerous if this is carelessly updated. Well, that's not the case, still.. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Wed, Feb 15, 2023 at 6:25 AM Andres Freund wrote: > > > Done, PSA v2 patch. > > This feels way too complicated to me. How about something more like the > attached? Thanks. I kind of did cost analysis of v2 and v3: Input: zero-fill a file of size 256*8K. v2 patch: iovec initialization with zerobuf for loop - 1 time zero-fill 32 blocks at once - 8 times stack memory - sizeof(PGAlignedBlock) + sizeof(struct iovec) * PG_IOV_MAX v3 patch: iovec initialization with zerobuf for loop - 8 times (7 times more than v2 patch) zero-fill 32 blocks at once - 8 times (no change from v2 patch) stack memory - sizeof(PGAlignedBlock) + sizeof(struct iovec) * PG_IOV_MAX (no change from v2 patch) The v3 patch reduces initialization of iovec array elements which is a clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ many times (I assume this is what is needed for the relation extension lock improvements feature). However, it increases the number of iovec initialization with zerobuf for the cases when pg_pwrite_zeros is called for sizes far greater than BLCKSZ (for instance, WAL file initialization). FWIW, I attached v4 patch, a simplified version of the v2 - it initializes all the iovec array elements if the total blocks to be written crosses lengthof(iovec array), otherwise it initializes only the needed blocks. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From a941ea3de79cbe890196fa16f60d1e0797a9b119 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 15 Feb 2023 05:17:22 + Subject: [PATCH v4] Refactor pg_pwrite_zeros() This commit refactors pg_pwrite_zeros() as follows: - add offset parameter - avoid memset() of zerobuf on every call - don't initialize the whole IOV array unnecessarily - don't handle the smaller trailing write in a separate write call Suggested-by: Andres Freund Author: Bharath Rupireddy Discussion: https://www.postgresql.org/message-id/20230211224424.r25uw4rsv6taukxk%40awork3.anarazel.d --- src/backend/access/transam/xlog.c | 2 +- src/bin/pg_basebackup/walmethods.c | 2 +- src/common/file_utils.c| 102 +++-- src/include/common/file_utils.h| 2 +- 4 files changed, 55 insertions(+), 53 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f9f0f6db8d..786b26054c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2981,7 +2981,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * indirect blocks are down on disk. Therefore, fdatasync(2) or * O_DSYNC will be sufficient to sync future writes to the log file. */ - rc = pg_pwrite_zeros(fd, wal_segment_size); + rc = pg_pwrite_zeros(fd, wal_segment_size, 0); if (rc < 0) save_errno = errno; diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index 54014e2b84..6d14b988cb 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -222,7 +222,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, { ssize_t rc; - rc = pg_pwrite_zeros(fd, pad_to_size); + rc = pg_pwrite_zeros(fd, pad_to_size, 0); if (rc < 0) { diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152..303bb07543 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -531,72 +531,74 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) /* * pg_pwrite_zeros * - * Writes zeros to file worth "size" bytes, using vectored I/O. + * Writes zeros to file worth "size" bytes from starting file offset 'offset', + * using vectored I/O. * * Returns the total amount of data written. On failure, a negative value * is returned with errno set. + * + * Note that pg_pwrite(), which gets called from this function, may change the + * file position (see win32pwrite.c). If needed, callers must deal with it. */ ssize_t -pg_pwrite_zeros(int fd, size_t size) +pg_pwrite_zeros(int fd, size_t size, off_t offset) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ - size_t zbuffer_sz; - struct iovec iov[PG_IOV_MAX]; - int blocks; - size_t remaining_size = 0; - int i; - ssize_t written; - ssize_t total_written = 0; - - zbuffer_sz = sizeof(zbuffer.data); - - /* Zero-fill the buffer. */ - memset(zbuffer.data, 0, zbuffer_sz); - - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (i = 0; i < lengthof(iov); ++i) + /* + * Zero-filled buffer worth BLCKSZ. We use static variable to avoid zeroing + * the buffer on every call. Also, with the const qualifier, it is ensured + * that the buffer is placed in read-only section of the process's memory. + */ + static const PGAlignedBlock zbuf = {0}; + static size_t zbuf_sz = BLCKSZ; + struct iovec iov[PG_IOV_MAX]; + int blocks; + int init_upto; + int i; + ssize_t ntotalwritten = 0; + size_t
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi On 2023-02-15 10:28:37 +0900, Kyotaro Horiguchi wrote: > At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund wrote > in > > Hi, > > > > On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote: > > > Done, PSA v2 patch. > > > > This feels way too complicated to me. How about something more like the > > attached? > > I like this one, but the parameters offset and size are in a different > order from pwrite(fd, buf, count, offset). I perfer the arrangement > suggested by Bharath. Yes, it probably is better. Not sure why I went with that order. > And isn't it better to use Min(remaining_size, BLCKSZ) instead of a bare if > statement? I really can't make myself care about which version is better :) Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund wrote in > Hi, > > On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote: > > Done, PSA v2 patch. > > This feels way too complicated to me. How about something more like the > attached? I like this one, but the parameters offset and size are in a different order from pwrite(fd, buf, count, offset). I perfer the arrangement suggested by Bharath. And isn't it better to use Min(remaining_size, BLCKSZ) instead of a bare if statement? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Feb 14, 2023 at 04:46:07PM -0800, Andres Freund wrote: > Then it really shouldn't have been named pg_pwrite_zeros(). The point of the > p{write,read}{,v} family of functions is to be able to specify the offset to > read/write at. I assume the p is for position, but I'm not sure. 'p' could stand for POSIX, though both read() and pread() are in it. Anyway, it looks that your guess may be right: https://stackoverflow.com/questions/17877556/what-does-p-stand-for-in-function-names-pwrite-and-pread Even there, people don't seem completely sure. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote: > On Mon, Feb 13, 2023 at 11:09 PM Andres Freund wrote: > > > > > Later code assigns iov[0].iov_len thus we need to provide a separate > > > iov non-const variable, or can we use pwrite instead there? (I didn't > > > find pg_pwrite_with_retry(), though) > > > > Given that we need to do that, and given that we already need to loop to > > handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not > > worth avoiding iov initialization. > > > > But I think it's worth limiting the initialization to blocks. > > We can still optimize away the for loop by using a single iovec for > remaining size, like the attached v2 patch. > > > I'd also try to combine the first pg_writev_* with the second one. > > Done, PSA v2 patch. This feels way too complicated to me. How about something more like the attached? > 2) A small test module passing in a file with the size to write isn't > multiple of block size, meaning, the code we have in the function to > write last remaining bytes (less than BLCKSZ) gets covered which isn't > covered right now - FWIW, I tested this locally by just specifying a smaller size than BLCKSZ for the write size. Greetings, Andres Freund >From ba0a8697c639870b76e92a285644a87d36cf4496 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 13 Feb 2023 17:11:29 -0800 Subject: [PATCH v3 01/12] Revise pg_pwrite_zeros() - add offset parameter - avoid memset() of zerobuf on every call - don't initialize the whole IOV array unnecessarily - don't handle the smaller trailing write in a separate write call --- src/include/common/file_utils.h| 2 +- src/common/file_utils.c| 67 ++ src/backend/access/transam/xlog.c | 2 +- src/bin/pg_basebackup/walmethods.c | 2 +- 4 files changed, 26 insertions(+), 47 deletions(-) diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h index bda6d3a5413..95b5cc8b3a1 100644 --- a/src/include/common/file_utils.h +++ b/src/include/common/file_utils.h @@ -44,6 +44,6 @@ extern ssize_t pg_pwritev_with_retry(int fd, int iovcnt, off_t offset); -extern ssize_t pg_pwrite_zeros(int fd, size_t size); +extern ssize_t pg_pwrite_zeros(int fd, off_t offset, size_t size); #endif /* FILE_UTILS_H */ diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152f..a161b1c7bbe 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -537,62 +537,41 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) * is returned with errno set. */ ssize_t -pg_pwrite_zeros(int fd, size_t size) +pg_pwrite_zeros(int fd, off_t offset, size_t size) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ - size_t zbuffer_sz; + const static PGAlignedBlock zbuffer = {0}; /* worth BLCKSZ */ + void *zerobuf_addr = unconstify(PGAlignedBlock *, )->data; struct iovec iov[PG_IOV_MAX]; - int blocks; - size_t remaining_size = 0; - int i; - ssize_t written; + size_t remaining_size = size; ssize_t total_written = 0; - zbuffer_sz = sizeof(zbuffer.data); - - /* Zero-fill the buffer. */ - memset(zbuffer.data, 0, zbuffer_sz); - - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (i = 0; i < lengthof(iov); ++i) - { - iov[i].iov_base = zbuffer.data; - iov[i].iov_len = zbuffer_sz; - } - /* Loop, writing as many blocks as we can for each system call. */ - blocks = size / zbuffer_sz; - remaining_size = size % zbuffer_sz; - for (i = 0; i < blocks;) + while (remaining_size > 0) { - int iovcnt = Min(blocks - i, lengthof(iov)); - off_t offset = i * zbuffer_sz; - - written = pg_pwritev_with_retry(fd, iov, iovcnt, offset); - - if (written < 0) - return written; - - i += iovcnt; - total_written += written; - } - - /* Now, write the remaining size, if any, of the file with zeros. */ - if (remaining_size > 0) - { - /* We'll never write more than one block here */ - int iovcnt = 1; - - /* Jump on to the end of previously written blocks */ - off_t offset = i * zbuffer_sz; - - iov[0].iov_len = remaining_size; + int iovcnt = 0; + ssize_t written; + + for (; iovcnt < PG_IOV_MAX && remaining_size > 0; iovcnt++) + { + size_t this_iov_size; + + iov[iovcnt].iov_base = zerobuf_addr; + + if (remaining_size < BLCKSZ) +this_iov_size = remaining_size; + else +this_iov_size = BLCKSZ; + + iov[iovcnt].iov_len = this_iov_size; + remaining_size -= this_iov_size; + } written = pg_pwritev_with_retry(fd, iov, iovcnt, offset); if (written < 0) return written; + offset += written; total_written += written; } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f9f0f6db8d1..cffe72e2386 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2981,7 +2981,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * indirect blocks
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-14 16:06:24 +0900, Michael Paquier wrote: > On Mon, Feb 13, 2023 at 05:10:56PM -0800, Andres Freund wrote: > > I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have > > an > > offset parameter. Huh, what lead to the function being so constrained? > > Its current set of uses cases, where we only use it now to initialize > with zeros with WAL segments. If you have a case that plans to use > that stuff with an offset, no problem with me. Then it really shouldn't have been named pg_pwrite_zeros(). The point of the p{write,read}{,v} family of functions is to be able to specify the offset to read/write at. I assume the p is for position, but I'm not sure. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Feb 13, 2023 at 11:09 PM Andres Freund wrote: > > > Later code assigns iov[0].iov_len thus we need to provide a separate > > iov non-const variable, or can we use pwrite instead there? (I didn't > > find pg_pwrite_with_retry(), though) > > Given that we need to do that, and given that we already need to loop to > handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not > worth avoiding iov initialization. > > But I think it's worth limiting the initialization to blocks. We can still optimize away the for loop by using a single iovec for remaining size, like the attached v2 patch. > I'd also try to combine the first pg_writev_* with the second one. Done, PSA v2 patch. On Tue, Feb 14, 2023 at 6:40 AM Andres Freund wrote: > > Hi, > > On 2023-02-12 09:31:36 -0800, Andres Freund wrote: > > Another thing: I think we should either avoid iterating over all the IOVs if > > we don't need them, or, even better, initialize the array as a constant, > > once. > > I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an > offset parameter. Huh, what lead to the function being so constrained? Done, PSA v2 patch. We could do few more things, but honestly I feel they're unnecessary: 1) An assert-only code that checks if the asked file contents are zeroed at the end of pg_pwrite_zeros (to be more defensive, but reading 16MB files and checking if it's zero-filled will surely slowdown the Assert builds). 2) A small test module passing in a file with the size to write isn't multiple of block size, meaning, the code we have in the function to write last remaining bytes (less than BLCKSZ) gets covered which isn't covered right now - https://coverage.postgresql.org/src/common/file_utils.c.gcov.html. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 4fd58060fa3fc301dff2b5fb38c105ad91bc8be4 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 14 Feb 2023 10:12:34 + Subject: [PATCH v2] Refactor pg_pwrite_zeros() This commit changes pg_pwrite_zeros() as follows: 1. Optimize zero buffer handling in pg_pwrite_zeros() by using static const zero-filled buffer to avoid memset-ing on every call. 2. Optimize zero buffer handling in pg_pwrite_zeros() by using static iovec and point the zero-filled buffer for the first time through to avoid for loop initializing iovec on every call. 3. Add offset parameter to pg_pwrite_zeros() to let callers specify the file position from where to zero-fill. 4. Combine the logic writing the last remaining bytes (less than one full block) into main for loop to avoid code duplication. Suggested-by: Andres Freund Author: Bharath Rupireddy Discussion: https://www.postgresql.org/message-id/20230211224424.r25uw4rsv6taukxk%40awork3.anarazel.d --- src/backend/access/transam/xlog.c | 2 +- src/bin/pg_basebackup/walmethods.c | 2 +- src/common/file_utils.c| 125 ++--- src/include/common/file_utils.h| 2 +- 4 files changed, 82 insertions(+), 49 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f9f0f6db8d..786b26054c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2981,7 +2981,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * indirect blocks are down on disk. Therefore, fdatasync(2) or * O_DSYNC will be sufficient to sync future writes to the log file. */ - rc = pg_pwrite_zeros(fd, wal_segment_size); + rc = pg_pwrite_zeros(fd, wal_segment_size, 0); if (rc < 0) save_errno = errno; diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index 54014e2b84..6d14b988cb 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -222,7 +222,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, { ssize_t rc; - rc = pg_pwrite_zeros(fd, pad_to_size); + rc = pg_pwrite_zeros(fd, pad_to_size, 0); if (rc < 0) { diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152..7faa7a72f6 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -531,72 +531,105 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) /* * pg_pwrite_zeros * - * Writes zeros to file worth "size" bytes, using vectored I/O. + * Writes zeros to file worth "size" bytes from starting file offset 'offset', + * using vectored I/O. * * Returns the total amount of data written. On failure, a negative value * is returned with errno set. + * + * Note that pg_pwrite(), which gets called from this function, may change the + * file position (see win32pwrite.c). If needed, callers must deal with it. */ ssize_t -pg_pwrite_zeros(int fd, size_t size) +pg_pwrite_zeros(int fd, size_t size, off_t offset) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ - size_t zbuffer_sz; - struct
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Feb 13, 2023 at 05:10:56PM -0800, Andres Freund wrote: > I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an > offset parameter. Huh, what lead to the function being so constrained? Its current set of uses cases, where we only use it now to initialize with zeros with WAL segments. If you have a case that plans to use that stuff with an offset, no problem with me. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-12 09:31:36 -0800, Andres Freund wrote: > Another thing: I think we should either avoid iterating over all the IOVs if > we don't need them, or, even better, initialize the array as a constant, once. I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an offset parameter. Huh, what lead to the function being so constrained? - Andres
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On 2023-02-13 18:33:34 +0900, Kyotaro Horiguchi wrote: > At Mon, 13 Feb 2023 10:15:03 +0530, Bharath Rupireddy > wrote in > > On Sun, Feb 12, 2023 at 11:01 PM Andres Freund wrote: > > > > > > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote: > > > > /* Prepare to write out a lot of copies of our zero buffer at > > > > once. */ > > > > for (i = 0; i < lengthof(iov); ++i) > > > > { > > > > - iov[i].iov_base = zbuffer.data; > > > > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, > > > > )->data); > > > > iov[i].iov_len = zbuffer_sz; > > > > } > > > > > > Another thing: I think we should either avoid iterating over all the IOVs > > > if > > > we don't need them, or, even better, initialize the array as a constant, > > > once. > > FWIW, I tried to use the "{[start .. end] = {}}" trick (GNU extension? > [*1]) for constant array initialization, but individual members don't > accept assigning a const value, thus I did deconstify as the follows. > > > static const struct iovec iov[PG_IOV_MAX] = > > {[0 ... PG_IOV_MAX - 1] = > > { > > .iov_base = (void *), > > .iov_len = BLCKSZ > > } > > }; > > I didn't checked the actual mapping, but if I tried an assignment > "iov[0].iov_base = NULL", it failed as "assignment of member > ‘iov_base’ in read-only object", so is it successfully placed in a > read-only segment? > > Later code assigns iov[0].iov_len thus we need to provide a separate > iov non-const variable, or can we use pwrite instead there? (I didn't > find pg_pwrite_with_retry(), though) Given that we need to do that, and given that we already need to loop to handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not worth avoiding iov initialization. But I think it's worth limiting the initialization to blocks. I'd also try to combine the first pg_writev_* with the second one. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
At Mon, 13 Feb 2023 10:15:03 +0530, Bharath Rupireddy wrote in > On Sun, Feb 12, 2023 at 11:01 PM Andres Freund wrote: > > > > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote: > > > /* Prepare to write out a lot of copies of our zero buffer at once. > > > */ > > > for (i = 0; i < lengthof(iov); ++i) > > > { > > > - iov[i].iov_base = zbuffer.data; > > > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, > > > )->data); > > > iov[i].iov_len = zbuffer_sz; > > > } > > > > Another thing: I think we should either avoid iterating over all the IOVs if > > we don't need them, or, even better, initialize the array as a constant, > > once. FWIW, I tried to use the "{[start .. end] = {}}" trick (GNU extension? [*1]) for constant array initialization, but individual members don't accept assigning a const value, thus I did deconstify as the follows. > static const struct iovec iov[PG_IOV_MAX] = > {[0 ... PG_IOV_MAX - 1] = >{ >.iov_base = (void *), >.iov_len = BLCKSZ >} > }; I didn't checked the actual mapping, but if I tried an assignment "iov[0].iov_base = NULL", it failed as "assignment of member ‘iov_base’ in read-only object", so is it successfully placed in a read-only segment? Later code assigns iov[0].iov_len thus we need to provide a separate iov non-const variable, or can we use pwrite instead there? (I didn't find pg_pwrite_with_retry(), though) > How about like the attached patch? It makes the iovec static variable > and points the zero buffer only once/for the first time to iovec. This > avoids for-loop on every call. As the patch itself, it seems forgetting to reset iov[0].iov_len after writing a partial block. retards. *1: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Designated-Inits.html -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Feb 12, 2023 at 11:01 PM Andres Freund wrote: > > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote: > > Thanks for looking at it. We know that we don't change the zbuffer in > > the function, so can we avoid static const and have just a static > > variable, like the attached > > v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do > > you see any problem with it? > > Making it static const is better, because it allows the memory for the > variable to be put in a readonly section. Okay. > > /* Prepare to write out a lot of copies of our zero buffer at once. */ > > for (i = 0; i < lengthof(iov); ++i) > > { > > - iov[i].iov_base = zbuffer.data; > > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, > > )->data); > > iov[i].iov_len = zbuffer_sz; > > } > > Another thing: I think we should either avoid iterating over all the IOVs if > we don't need them, or, even better, initialize the array as a constant, once. How about like the attached patch? It makes the iovec static variable and points the zero buffer only once/for the first time to iovec. This avoids for-loop on every call. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 550b606affb614980a4b814d9e44e73d5489 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Mon, 13 Feb 2023 04:25:34 + Subject: [PATCH v1] Optimize zero buffer handling in pg_pwrite_zeros() 1. Use static const zero-filled buffer to avoid memset-ing on every call. 2. Use static iovec and point the zero-filled buffer for the first time through to avoid for loop initializing iovec on every call. Reported-by: Andres Freund Author: Bharath Rupireddy Discussion: https://www.postgresql.org/message-id/20230211224424.r25uw4rsv6taukxk%40awork3.anarazel.de --- src/common/file_utils.c | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152..08688eb1b1 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -539,25 +539,35 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) ssize_t pg_pwrite_zeros(int fd, size_t size) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ - size_t zbuffer_sz; - struct iovec iov[PG_IOV_MAX]; + /* + * Zero-filled buffer worth BLCKSZ. We use static variable to avoid zeroing + * the buffer on every call. Also, with the const qualifier, it is ensured + * that the buffer is placed in read-only section of the process's memory. + */ + static const PGAlignedBlock zbuffer = {0}; + static size_t zbuffer_sz = BLCKSZ; + static bool first_time = true; + static struct iovec iov[PG_IOV_MAX] = {0}; int blocks; size_t remaining_size = 0; int i; ssize_t written; ssize_t total_written = 0; - zbuffer_sz = sizeof(zbuffer.data); - - /* Zero-fill the buffer. */ - memset(zbuffer.data, 0, zbuffer_sz); - - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (i = 0; i < lengthof(iov); ++i) + /* + * If first time through, point the zero buffer to iovec structure. This + * avoids for loop on every call. + */ + if (first_time) { - iov[i].iov_base = zbuffer.data; - iov[i].iov_len = zbuffer_sz; + /* Prepare to write out a lot of copies of our zero buffer at once. */ + for (i = 0; i < lengthof(iov); ++i) + { + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, )->data); + iov[i].iov_len = zbuffer_sz; + } + + first_time = false; } /* Loop, writing as many blocks as we can for each system call. */ -- 2.34.1
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Feb 12, 2023 at 09:31:36AM -0800, Andres Freund wrote: > Another thing: I think we should either avoid iterating over all the IOVs if > we don't need them, or, even better, initialize the array as a constant, once. Where you imply that we'd still use memset() once on iov[PG_IOV_MAX], right? -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote: > Thanks for looking at it. We know that we don't change the zbuffer in > the function, so can we avoid static const and have just a static > variable, like the attached > v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do > you see any problem with it? Making it static const is better, because it allows the memory for the variable to be put in a readonly section. > /* Prepare to write out a lot of copies of our zero buffer at once. */ > for (i = 0; i < lengthof(iov); ++i) > { > - iov[i].iov_base = zbuffer.data; > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, > )->data); > iov[i].iov_len = zbuffer_sz; > } Another thing: I think we should either avoid iterating over all the IOVs if we don't need them, or, even better, initialize the array as a constant, once. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Feb 12, 2023 at 4:14 AM Andres Freund wrote: > > > +ssize_t > > +pg_pwrite_zeros(int fd, size_t size) > > +{ > > + PGAlignedBlock zbuffer; > > + size_t zbuffer_sz; > > + struct ioveciov[PG_IOV_MAX]; > > + int blocks; > > + size_t remaining_size = 0; > > + int i; > > + ssize_t written; > > + ssize_t total_written = 0; > > + > > + zbuffer_sz = sizeof(zbuffer.data); > > + > > + /* Zero-fill the buffer. */ > > + memset(zbuffer.data, 0, zbuffer_sz); > > I previously commented on this - why are we memseting a buffer on every call > to this? That's not at all free. > > Something like > static const PGAlignedBlock zerobuf = {0}; > would do the trick. You do need to cast the const away, to assign to > iov_base, but that's not too ugly. Thanks for looking at it. We know that we don't change the zbuffer in the function, so can we avoid static const and have just a static variable, like the attached v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do you see any problem with it? FWIW, it comes out like the attached v1-0001-Use-static-const-variable-to-avoid-memset-calls-i.patch with static const. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 2527c6083bbdf006900be1a1dbf15ed815184d2b Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sun, 12 Feb 2023 10:58:33 + Subject: [PATCH v1] Use static variable to avoid memset() calls in pg_pwrite_zeros() --- src/common/file_utils.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152..0d59488f60 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -539,7 +539,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) ssize_t pg_pwrite_zeros(int fd, size_t size) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ + /* Zero-filled buffer worth BLCKSZ. */ + static PGAlignedBlock zbuffer = {0}; size_t zbuffer_sz; struct iovec iov[PG_IOV_MAX]; int blocks; @@ -548,10 +549,7 @@ pg_pwrite_zeros(int fd, size_t size) ssize_t written; ssize_t total_written = 0; - zbuffer_sz = sizeof(zbuffer.data); - - /* Zero-fill the buffer. */ - memset(zbuffer.data, 0, zbuffer_sz); + zbuffer_sz = BLCKSZ; /* Prepare to write out a lot of copies of our zero buffer at once. */ for (i = 0; i < lengthof(iov); ++i) -- 2.34.1 From adc5ea83fb6726a5d8a1f3b87d6b3b03550c7fcb Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sun, 12 Feb 2023 10:55:09 + Subject: [PATCH v1] Use static const variable to avoid memset calls in pg_pwrite_zeros --- src/common/file_utils.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 4dae534152..cd15cc2fdb 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -539,7 +539,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) ssize_t pg_pwrite_zeros(int fd, size_t size) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ + /* Zero-filled buffer worth BLCKSZ. */ + static const PGAlignedBlock zbuffer = {0}; size_t zbuffer_sz; struct iovec iov[PG_IOV_MAX]; int blocks; @@ -548,15 +549,12 @@ pg_pwrite_zeros(int fd, size_t size) ssize_t written; ssize_t total_written = 0; - zbuffer_sz = sizeof(zbuffer.data); - - /* Zero-fill the buffer. */ - memset(zbuffer.data, 0, zbuffer_sz); + zbuffer_sz = BLCKSZ; /* Prepare to write out a lot of copies of our zero buffer at once. */ for (i = 0; i < lengthof(iov); ++i) { - iov[i].iov_base = zbuffer.data; + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, )->data); iov[i].iov_len = zbuffer_sz; } -- 2.34.1
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2022-11-01 08:32:48 +0530, Bharath Rupireddy wrote: > +/* > + * pg_pwrite_zeros > + * > + * Writes zeros to a given file. Input parameters are "fd" (file descriptor > of > + * the file), "size" (size of the file in bytes). > + * > + * On failure, a negative value is returned and errno is set appropriately so > + * that the caller can use it accordingly. > + */ > +ssize_t > +pg_pwrite_zeros(int fd, size_t size) > +{ > + PGAlignedBlock zbuffer; > + size_t zbuffer_sz; > + struct ioveciov[PG_IOV_MAX]; > + int blocks; > + size_t remaining_size = 0; > + int i; > + ssize_t written; > + ssize_t total_written = 0; > + > + zbuffer_sz = sizeof(zbuffer.data); > + > + /* Zero-fill the buffer. */ > + memset(zbuffer.data, 0, zbuffer_sz); I previously commented on this - why are we memseting a buffer on every call to this? That's not at all free. Something like static const PGAlignedBlock zerobuf = {0}; would do the trick. You do need to cast the const away, to assign to iov_base, but that's not too ugly. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Nov 11, 2022 at 2:12 PM Michael Paquier wrote: > > On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote: > > On Fri, Nov 11, 2022 at 11:14 AM John Naylor < john.nay...@enterprisedb.com> wrote: > >> Was there supposed to be an attachment here? > > > > Nope. The patches have already been committed - > > 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and > > 28cc2976a9cf0ed661dbc55f49f669192cce1c89. > > The committed patches are pretty much the same as the last version > sent on this thread, except that the changes have been split across > the files they locally impact, with a few simplifications tweaks to > the comments. Hencem I did not see any need to send a new version for > this case. Ah, I missed that -- sorry for the noise. -- John Naylor EDB: http://www.enterprisedb.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote: > On Fri, Nov 11, 2022 at 11:14 AM John Naylor > wrote: >> Was there supposed to be an attachment here? > > Nope. The patches have already been committed - > 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and > 28cc2976a9cf0ed661dbc55f49f669192cce1c89. The committed patches are pretty much the same as the last version sent on this thread, except that the changes have been split across the files they locally impact, with a few simplifications tweaks to the comments. Hencem I did not see any need to send a new version for this case. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Nov 11, 2022 at 11:14 AM John Naylor wrote: > > On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier wrote: > > > > So, I have looked at that, and at the end concluded that Andres' > > suggestion to use PGAlignedBlock in pg_write_zeros() will serve better > > in the long run. Thomas has mentioned upthread that some of the > > comments don't need to be that long, so I have tweaked these to be > > minimal, and updated a few more areas. Note that this has been split > > into two commits: one to introduce the new routine in file_utils.c and > > a second for the switch in walmethods.c. > > Was there supposed to be an attachment here? Nope. The patches have already been committed - 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and 28cc2976a9cf0ed661dbc55f49f669192cce1c89. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier wrote: > > So, I have looked at that, and at the end concluded that Andres' > suggestion to use PGAlignedBlock in pg_write_zeros() will serve better > in the long run. Thomas has mentioned upthread that some of the > comments don't need to be that long, so I have tweaked these to be > minimal, and updated a few more areas. Note that this has been split > into two commits: one to introduce the new routine in file_utils.c and > a second for the switch in walmethods.c. Was there supposed to be an attachment here? -- John Naylor EDB: http://www.enterprisedb.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Nov 01, 2022 at 08:32:48AM +0530, Bharath Rupireddy wrote: > I'm attaching the v9 patch from upthread here again for further review > and to make CF bot happy. So, I have looked at that, and at the end concluded that Andres' suggestion to use PGAlignedBlock in pg_write_zeros() will serve better in the long run. Thomas has mentioned upthread that some of the comments don't need to be that long, so I have tweaked these to be minimal, and updated a few more areas. Note that this has been split into two commits: one to introduce the new routine in file_utils.c and a second for the switch in walmethods.c. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Oct 31, 2022 at 11:50 AM Bharath Rupireddy wrote: > > On Mon, Oct 31, 2022 at 5:01 AM Michael Paquier wrote: > > > > On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote: > > > So I'm kinda proposing that we only do the forward struct initialization > > > dance when it really saves on things -- in particular, when it helps > > > avoid or reduce massive indirect header inclusion. > > > > Sure. > > I don't think including pg_iovec.h in file_utils.h is a good idea. I > agree that pg_iovec.h is fairly a small header file but file_utils.h > is included in 21 .c files, as of today and the file_utils.h footprint > might increase in future. Therefore, I'd vote for forward struct > initialization as it is on HEAD today. I'm attaching the v9 patch from upthread here again for further review and to make CF bot happy. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v9-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch Description: Binary data
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Oct 31, 2022 at 5:01 AM Michael Paquier wrote: > > On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote: > > So I'm kinda proposing that we only do the forward struct initialization > > dance when it really saves on things -- in particular, when it helps > > avoid or reduce massive indirect header inclusion. > > Sure. I don't think including pg_iovec.h in file_utils.h is a good idea. I agree that pg_iovec.h is fairly a small header file but file_utils.h is included in 21 .c files, as of today and the file_utils.h footprint might increase in future. Therefore, I'd vote for forward struct initialization as it is on HEAD today. > > extern ssize_t pg_pwritev_with_retry(int fd, > > - const struct iovec *iov, > > + const iovec *iov, > > int iovcnt, > > off_t offset); > > However this still needs to be defined as a struct, no? Yes, we need a struct there because we haven't typedef'ed struct iovec. Also, the patch forgets to remove "port/pg_iovec.h" from file_utils.c -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote: > So I'm kinda proposing that we only do the forward struct initialization > dance when it really saves on things -- in particular, when it helps > avoid or reduce massive indirect header inclusion. Sure. > extern ssize_t pg_pwritev_with_retry(int fd, > - const struct iovec *iov, > + const iovec *iov, > int iovcnt, > off_t offset); However this still needs to be defined as a struct, no? -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On 2022-Oct-27, Michael Paquier wrote: > On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote: > > Looks reasonable to me. > > 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so > applied. Maybe something a bit useless, but while perusing the commits I noticed a forward struct declaration was moved from one file to another; this is claimed to avoid including port/pg_iovec.h in common/file_utils.h. We do that kind of thing in a few places, but in this particular case it seems a bit of a pointless exercise, since pg_iovec.h doesn't include anything else and it's a quite simple header. So I'm kinda proposing that we only do the forward struct initialization dance when it really saves on things -- in particular, when it helps avoid or reduce massive indirect header inclusion. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ >From c170367e5ac7dca2dcd11b5cbb6d351b1b205842 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Sun, 30 Oct 2022 15:22:24 +0100 Subject: [PATCH] No need to avoid including pg_iovec.h --- src/include/common/file_utils.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h index 2c5dbcb0b1..20fe5806fb 100644 --- a/src/include/common/file_utils.h +++ b/src/include/common/file_utils.h @@ -15,6 +15,8 @@ #include +#include "port/pg_iovec.h" + typedef enum PGFileType { PGFILETYPE_ERROR, @@ -24,7 +26,6 @@ typedef enum PGFileType PGFILETYPE_LNK } PGFileType; -struct iovec; /* avoid including port/pg_iovec.h here */ #ifdef FRONTEND extern int fsync_fname(const char *fname, bool isdir); @@ -40,7 +41,7 @@ extern PGFileType get_dirent_type(const char *path, int elevel); extern ssize_t pg_pwritev_with_retry(int fd, - const struct iovec *iov, + const iovec *iov, int iovcnt, off_t offset); -- 2.30.2
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sat, Oct 29, 2022 at 3:37 AM Andres Freund wrote: > > Hi, > > On 2022-10-28 11:09:38 +0900, Michael Paquier wrote: > > On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote: > > > The block sizes don't need to match, do they? As long as the block is > > > properly > > > aligned, we can change the iov_len of the final iov to match whatever the > > > size > > > is being passed in, no? > > > > Hmm. Based on what Bharath has written upthread, it does not seem to > > matter if the size of the aligned block changes, either: > > https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com > > > > I am honestly not sure whether it is a good idea to make file_utils.c > > depend on one of the compile-time page sizes in this routine, be it > > the page size of the WAL page size, as pg_write_zeros() would be used > > for some rather low-level operations. But we could as well just use a > > locally-defined structure with a buffer at 4kB or 8kB and call it a > > day? > > Shrug. I don't think we gain much by having yet another PGAlignedXYZBlock. Hm. I tend to agree with Andres here, i.e. using PGAlignedBlock is sufficient. It seems like we are using PGAlignedBlock for heap, index, history file, fsm, visibility map file pages as well. Please see the attached v9 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 16ab6ae0b00ebcd6af017299a70bb687c83b2885 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sat, 29 Oct 2022 06:15:36 + Subject: [PATCH v9] Use pg_pwritev_with_retry() instead of write() in walmethods.c Use pg_pwritev_with_retry() while prepadding a WAL segment instead of write() in walmethods.c dir_open_for_write() to avoid partial writes. As the pg_pwritev_with_retry() function uses pwritev, we can avoid explicit lseek() on non-Windows platforms to seek to the beginning of the WAL segment. It looks like on Windows, we need an explicit lseek() call here despite using pwrite() implementation from win32pwrite.c. Otherwise an error occurs. These changes are inline with how core postgres initializes the WAL segment in XLogFileInitInternal(). Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Michael Paquier Reviewed-by: Thomas Munro, Andres Freund Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com --- src/backend/access/transam/xlog.c | 33 ++-- src/bin/pg_basebackup/walmethods.c | 26 +- src/common/file_utils.c| 80 ++ src/include/common/file_utils.h| 3 ++ 4 files changed, 102 insertions(+), 40 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8f10effe3a..15dd5ce834 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, bool *added, char *path) { char tmppath[MAXPGPATH]; - PGAlignedXLogBlock zbuffer; XLogSegNo installed_segno; XLogSegNo max_segno; int fd; @@ -2965,14 +2964,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); - memset(zbuffer.data, 0, XLOG_BLCKSZ); - pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); save_errno = 0; if (wal_init_zero) { - struct iovec iov[PG_IOV_MAX]; - int blocks; + ssize_t rc; /* * Zero-fill the file. With this setting, we do this the hard way to @@ -2983,29 +2979,10 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * indirect blocks are down on disk. Therefore, fdatasync(2) or * O_DSYNC will be sufficient to sync future writes to the log file. */ + rc = pg_pwrite_zeros(fd, wal_segment_size); - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (int i = 0; i < lengthof(iov); ++i) - { - iov[i].iov_base = zbuffer.data; - iov[i].iov_len = XLOG_BLCKSZ; - } - - /* Loop, writing as many blocks as we can for each system call. */ - blocks = wal_segment_size / XLOG_BLCKSZ; - for (int i = 0; i < blocks;) - { - int iovcnt = Min(blocks - i, lengthof(iov)); - off_t offset = i * XLOG_BLCKSZ; - - if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0) - { -save_errno = errno; -break; - } - - i += iovcnt; - } + if (rc < 0) + save_errno = errno; } else { @@ -3014,7 +2991,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * enough. */ errno = 0; - if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1) + if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1) { /* if write didn't set errno, assume no disk space */ save_errno = errno ? errno : ENOSPC; diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2022-10-28 11:09:38 +0900, Michael Paquier wrote: > On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote: > > The block sizes don't need to match, do they? As long as the block is > > properly > > aligned, we can change the iov_len of the final iov to match whatever the > > size > > is being passed in, no? > > Hmm. Based on what Bharath has written upthread, it does not seem to > matter if the size of the aligned block changes, either: > https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com > > I am honestly not sure whether it is a good idea to make file_utils.c > depend on one of the compile-time page sizes in this routine, be it > the page size of the WAL page size, as pg_write_zeros() would be used > for some rather low-level operations. But we could as well just use a > locally-defined structure with a buffer at 4kB or 8kB and call it a > day? Shrug. I don't think we gain much by having yet another PGAlignedXYZBlock. Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Oct 28, 2022 at 11:38:51AM +0530, Bharath Rupireddy wrote: > +1. Please see the attached v8 patch. + chardata[PG_WRITE_BLCKSZ]; + double force_align_d; + int64 force_align_i64; +} PGAlignedWriteBlock; I have not checked in details, but that should do the job. Andres, Thomas, Nathan, perhaps you have a different view on the matter? -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Oct 28, 2022 at 7:39 AM Michael Paquier wrote: > > On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote: > > The block sizes don't need to match, do they? As long as the block is properly > > aligned, we can change the iov_len of the final iov to match whatever the size > > is being passed in, no? > > Hmm. Based on what Bharath has written upthread, it does not seem to > matter if the size of the aligned block changes, either: > https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com > > I am honestly not sure whether it is a good idea to make file_utils.c > depend on one of the compile-time page sizes in this routine, be it > the page size of the WAL page size, as pg_write_zeros() would be used > for some rather low-level operations. But we could as well just use a > locally-defined structure with a buffer at 4kB or 8kB and call it a > day? +1. Please see the attached v8 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 88d67c6e3e92a62ac4c783a137759b2f3795cb21 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 28 Oct 2022 04:44:41 + Subject: [PATCH v8] Use pg_pwritev_with_retry() instead of write() in walmethods.c Use pg_pwritev_with_retry() while prepadding a WAL segment instead of write() in walmethods.c dir_open_for_write() to avoid partial writes. As the pg_pwritev_with_retry() function uses pwritev, we can avoid explicit lseek() on non-Windows platforms to seek to the beginning of the WAL segment. It looks like on Windows, we need an explicit lseek() call here despite using pwrite() implementation from win32pwrite.c. Otherwise an error occurs. These changes are inline with how core postgres initializes the WAL segment in XLogFileInitInternal(). Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Michael Paquier Reviewed-by: Thomas Munro, Andres Freund Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com --- src/backend/access/transam/xlog.c | 33 ++- src/bin/pg_basebackup/walmethods.c | 26 + src/common/file_utils.c| 92 ++ src/include/common/file_utils.h| 3 + 4 files changed, 114 insertions(+), 40 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8f10effe3a..15dd5ce834 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, bool *added, char *path) { char tmppath[MAXPGPATH]; - PGAlignedXLogBlock zbuffer; XLogSegNo installed_segno; XLogSegNo max_segno; int fd; @@ -2965,14 +2964,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); - memset(zbuffer.data, 0, XLOG_BLCKSZ); - pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); save_errno = 0; if (wal_init_zero) { - struct iovec iov[PG_IOV_MAX]; - int blocks; + ssize_t rc; /* * Zero-fill the file. With this setting, we do this the hard way to @@ -2983,29 +2979,10 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * indirect blocks are down on disk. Therefore, fdatasync(2) or * O_DSYNC will be sufficient to sync future writes to the log file. */ + rc = pg_pwrite_zeros(fd, wal_segment_size); - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (int i = 0; i < lengthof(iov); ++i) - { - iov[i].iov_base = zbuffer.data; - iov[i].iov_len = XLOG_BLCKSZ; - } - - /* Loop, writing as many blocks as we can for each system call. */ - blocks = wal_segment_size / XLOG_BLCKSZ; - for (int i = 0; i < blocks;) - { - int iovcnt = Min(blocks - i, lengthof(iov)); - off_t offset = i * XLOG_BLCKSZ; - - if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0) - { -save_errno = errno; -break; - } - - i += iovcnt; - } + if (rc < 0) + save_errno = errno; } else { @@ -3014,7 +2991,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * enough. */ errno = 0; - if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1) + if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1) { /* if write didn't set errno, assume no disk space */ save_errno = errno ? errno : ENOSPC; diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index bc2e83d02b..18bad52c96 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -220,22 +220,24 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, /* Do pre-padding on non-compressed files */ if (pad_to_size && wwmethod->compression_algorithm == PG_COMPRESSION_NONE) { - PGAlignedXLogBlock zerobuf; - int bytes; + ssize_t rc;
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote: > The block sizes don't need to match, do they? As long as the block is properly > aligned, we can change the iov_len of the final iov to match whatever the size > is being passed in, no? Hmm. Based on what Bharath has written upthread, it does not seem to matter if the size of the aligned block changes, either: https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com I am honestly not sure whether it is a good idea to make file_utils.c depend on one of the compile-time page sizes in this routine, be it the page size of the WAL page size, as pg_write_zeros() would be used for some rather low-level operations. But we could as well just use a locally-defined structure with a buffer at 4kB or 8kB and call it a day? -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, Interestingly, I also needed something like pg_pwrite_zeros() today. Exposed via smgr, for more efficient relation extensions. On 2022-10-27 14:54:00 +0900, Michael Paquier wrote: > Regarding 0002, using pg_pwrite_zeros() as a routine name, as > suggested by Thomas, sounds good to me. However, I am not really a > fan of its dependency with PGAlignedXLogBlock, because it should be > able to work with any buffers of any sizes, as long as the input > buffer is aligned, shouldn't it? For example, what about > PGAlignedBlock? So, should we make this more extensible? My guess > would be the addition of the block size and the block pointer to the > arguments of pg_pwrite_zeros(), in combination with a check to make > sure that the input buffer is MAXALIGN()'d (with an Assert() rather > than just an elog/pg_log_error?). I don't like passing in the buffer. That leads to code like in Bharat's latest version, where we now zero that buffer on every invocation of pg_pwrite_zeros() - not at all cheap. And every caller has to have provisions to provide that buffer. The block sizes don't need to match, do they? As long as the block is properly aligned, we can change the iov_len of the final iov to match whatever the size is being passed in, no? Why don't we define a static PGAlignedBlock source_of_zeroes; in file_utils.c, and use that in pg_pwrite_zeros(), being careful to set the iov_len arguments correctly? Greetings, Andres Freund
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Thu, Oct 27, 2022 at 11:24 AM Michael Paquier wrote: > > On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote: > > Looks reasonable to me. > > 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so > applied. Thanks. > Regarding 0002, using pg_pwrite_zeros() as a routine name, as > suggested by Thomas, sounds good to me. Changed. > However, I am not really a > fan of its dependency with PGAlignedXLogBlock, because it should be > able to work with any buffers of any sizes, as long as the input > buffer is aligned, shouldn't it? For example, what about > PGAlignedBlock? So, should we make this more extensible? My guess > would be the addition of the block size and the block pointer to the > arguments of pg_pwrite_zeros(), in combination with a check to make > sure that the input buffer is MAXALIGN()'d (with an Assert() rather > than just an elog/pg_log_error?). +1 to pass in the aligned buffer, its size and an assertion on the buffer size. Please see the attached v7 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 1d81929de5f5b56d39fc7f1273cdb4c1e36337bb Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Thu, 27 Oct 2022 08:51:07 + Subject: [PATCH v7] Use pg_pwritev_with_retry() instead of write() in walmethods.c Use pg_pwritev_with_retry() while prepadding a WAL segment instead of write() in walmethods.c dir_open_for_write() to avoid partial writes. As the pg_pwritev_with_retry() function uses pwritev, we can avoid explicit lseek() on non-Windows platforms to seek to the beginning of the WAL segment. It looks like on Windows, we need an explicit lseek() call here despite using pwrite() implementation from win32pwrite.c. Otherwise an error occurs. These changes are inline with how core postgres initializes the WAL segment in XLogFileInitInternal(). Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Michael Paquier Reviewed-by: Thomas Munro Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com --- src/backend/access/transam/xlog.c | 35 +++--- src/bin/pg_basebackup/walmethods.c | 28 ++- src/common/file_utils.c| 77 ++ src/include/common/file_utils.h| 5 ++ 4 files changed, 105 insertions(+), 40 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8f10effe3a..638d6d448b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, bool *added, char *path) { char tmppath[MAXPGPATH]; - PGAlignedXLogBlock zbuffer; XLogSegNo installed_segno; XLogSegNo max_segno; int fd; @@ -2965,14 +2964,12 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); - memset(zbuffer.data, 0, XLOG_BLCKSZ); - pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); save_errno = 0; if (wal_init_zero) { - struct iovec iov[PG_IOV_MAX]; - int blocks; + PGAlignedXLogBlock zbuffer; + ssize_t rc; /* * Zero-fill the file. With this setting, we do this the hard way to @@ -2983,29 +2980,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * indirect blocks are down on disk. Therefore, fdatasync(2) or * O_DSYNC will be sufficient to sync future writes to the log file. */ + rc = pg_pwrite_zeros(fd, wal_segment_size, zbuffer.data, + sizeof(zbuffer.data)); - /* Prepare to write out a lot of copies of our zero buffer at once. */ - for (int i = 0; i < lengthof(iov); ++i) - { - iov[i].iov_base = zbuffer.data; - iov[i].iov_len = XLOG_BLCKSZ; - } - - /* Loop, writing as many blocks as we can for each system call. */ - blocks = wal_segment_size / XLOG_BLCKSZ; - for (int i = 0; i < blocks;) - { - int iovcnt = Min(blocks - i, lengthof(iov)); - off_t offset = i * XLOG_BLCKSZ; - - if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0) - { -save_errno = errno; -break; - } - - i += iovcnt; - } + if (rc < 0) + save_errno = errno; } else { @@ -3014,7 +2993,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * enough. */ errno = 0; - if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1) + if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1) { /* if write didn't set errno, assume no disk space */ save_errno = errno ? errno : ENOSPC; diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index bc2e83d02b..f7af73c01b 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -220,22 +220,26 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, /* Do pre-padding on non-compressed files */ if (pad_to_size
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote: > Looks reasonable to me. 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so applied. Regarding 0002, using pg_pwrite_zeros() as a routine name, as suggested by Thomas, sounds good to me. However, I am not really a fan of its dependency with PGAlignedXLogBlock, because it should be able to work with any buffers of any sizes, as long as the input buffer is aligned, shouldn't it? For example, what about PGAlignedBlock? So, should we make this more extensible? My guess would be the addition of the block size and the block pointer to the arguments of pg_pwrite_zeros(), in combination with a check to make sure that the input buffer is MAXALIGN()'d (with an Assert() rather than just an elog/pg_log_error?). -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Sep 30, 2022 at 08:09:04AM +0530, Bharath Rupireddy wrote: > I'm attaching the v6 patch set, please review it further. Looks reasonable to me. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Sep 30, 2022 at 6:46 AM Thomas Munro wrote: > > +1 for just doing it always, with a one-liner comment like > "pg_pwritev*() might move the file position". No reason to spam the > source tree with more explanations of the exact reason. +1 for resetting the file position in a platform-independent manner. But, a description there won't hurt IMO and it saves time for the hackers who spend time there and think why it's that way. > If someone > ever comes up with another case where p- and non-p- I/O functions are > intermixed and it's really worth saving a system call (don't get me > wrong, we call lseek() an obscene amount elsewhere and I'd like to fix > that, but this case isn't hot?) then I like your idea of a macro to > tell you whether you need to. I don't think we go that route as the code isn't a hot path and an extra system call wouldn't hurt performance much, a comment there should work. > Earlier I wondered why we'd want to include "pg_pwritev" in the name > of this zero-filling function (pwritev being an internal > implementation detail), but now it seems like maybe a good idea > because it highlights the file position portability problem by being a > member of that family of similarly-named functions. Hm. On Thu, Sep 29, 2022 at 10:57 PM Nathan Bossart wrote: > > +iov[0].iov_base = zbuffer.data; > > This seems superfluous, but I don't think it's hurting anything. Yes, I removed it. Adding a comment, something like [1], would make it more verbose, hence I've not added. I'm attaching the v6 patch set, please review it further. [1] /* * Use the first vector buffer to write the remaining size. Note that * zero buffer was already pointed to it above, hence just specifying * the size is enough here. */ -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 7ae3f77c0444e6c5383e13a502457813b850fd08 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 30 Sep 2022 01:26:03 + Subject: [PATCH v6] Move pg_pwritev_with_retry() to file_utils.c Move pg_pwritev_with_retry() from file/fd.c to common/file_utils.c so that non-backend code or FRONTEND cases can also use it. Author: Bharath Rupireddy Reviewed-by: Thomas Munro Reviewed-by: Nathan Bossart Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJKwUrpP0igOFAv5khj3dbHvfWqLzFeo7vtNzDYXv_YZw%40mail.gmail.com --- src/backend/storage/file/fd.c | 65 - src/common/file_utils.c | 65 + src/include/common/file_utils.h | 7 src/include/storage/fd.h| 6 --- 4 files changed, 72 insertions(+), 71 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index e4d954578c..4151cafec5 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -93,7 +93,6 @@ #include "common/pg_prng.h" #include "miscadmin.h" #include "pgstat.h" -#include "port/pg_iovec.h" #include "portability/mem.h" #include "postmaster/startup.h" #include "storage/fd.h" @@ -3738,67 +3737,3 @@ data_sync_elevel(int elevel) { return data_sync_retry ? elevel : PANIC; } - -/* - * A convenience wrapper for pg_pwritev() that retries on partial write. If an - * error is returned, it is unspecified how much has been written. - */ -ssize_t -pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) -{ - struct iovec iov_copy[PG_IOV_MAX]; - ssize_t sum = 0; - ssize_t part; - - /* We'd better have space to make a copy, in case we need to retry. */ - if (iovcnt > PG_IOV_MAX) - { - errno = EINVAL; - return -1; - } - - for (;;) - { - /* Write as much as we can. */ - part = pg_pwritev(fd, iov, iovcnt, offset); - if (part < 0) - return -1; - -#ifdef SIMULATE_SHORT_WRITE - part = Min(part, 4096); -#endif - - /* Count our progress. */ - sum += part; - offset += part; - - /* Step over iovecs that are done. */ - while (iovcnt > 0 && iov->iov_len <= part) - { - part -= iov->iov_len; - ++iov; - --iovcnt; - } - - /* Are they all done? */ - if (iovcnt == 0) - { - /* We don't expect the kernel to write more than requested. */ - Assert(part == 0); - break; - } - - /* - * Move whatever's left to the front of our mutable copy and adjust - * the leading iovec. - */ - Assert(iovcnt > 0); - memmove(iov_copy, iov, sizeof(*iov) * iovcnt); - Assert(iov->iov_len > part); - iov_copy[0].iov_base = (char *) iov_copy[0].iov_base + part; - iov_copy[0].iov_len -= part; - iov = iov_copy; - } - - return sum; -} diff --git a/src/common/file_utils.c b/src/common/file_utils.c index df4d6d240c..4a4bdc31c4 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -28,6 +28,7 @@ #ifdef FRONTEND #include "common/logging.h" #endif +#include "port/pg_iovec.h" #ifdef FRONTEND @@ -460,3 +461,67 @@ get_dirent_type(const char *path, return result; } + +/* + * A
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Sep 30, 2022 at 6:27 AM Nathan Bossart wrote: > On Thu, Sep 29, 2022 at 11:32:32AM +0530, Bharath Rupireddy wrote: > > On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart > > wrote: > >> > >> +PGAlignedXLogBlock zbuffer; > >> + > >> +memset(zbuffer.data, 0, XLOG_BLCKSZ); > >> > >> This seems excessive for only writing a single byte. > > > > Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1, > > wal_segment_size - 1). > > I don't think removing the use of PGAlignedXLogBlock here introduces any > sort of alignment risk, so this should be alright. > > +#ifdef WIN32 > +/* > + * WriteFile() on Windows changes the current file position, hence we > + * need an explicit lseek() here. See pg_pwrite() implementation in > + * win32pwrite.c for more details. > + */ > > Should we really surround this with a WIN32 check, or should we just > unconditionally lseek() here? I understand that this helps avoid an extra > system call on many platforms, but in theory another platform introduced in > the future could have the same problem, and this seems like something that > could easily be missed. Presumably we could do something fancier to > indicate pg_pwrite()'s behavior in this regard, but I don't know if that > sort of complexity is really worth it in order to save an lseek(). +1 for just doing it always, with a one-liner comment like "pg_pwritev*() might move the file position". No reason to spam the source tree with more explanations of the exact reason. If someone ever comes up with another case where p- and non-p- I/O functions are intermixed and it's really worth saving a system call (don't get me wrong, we call lseek() an obscene amount elsewhere and I'd like to fix that, but this case isn't hot?) then I like your idea of a macro to tell you whether you need to. Earlier I wondered why we'd want to include "pg_pwritev" in the name of this zero-filling function (pwritev being an internal implementation detail), but now it seems like maybe a good idea because it highlights the file position portability problem by being a member of that family of similarly-named functions.
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Thu, Sep 29, 2022 at 11:32:32AM +0530, Bharath Rupireddy wrote: > On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart > wrote: >> >> +PGAlignedXLogBlock zbuffer; >> + >> +memset(zbuffer.data, 0, XLOG_BLCKSZ); >> >> This seems excessive for only writing a single byte. > > Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1, > wal_segment_size - 1). I don't think removing the use of PGAlignedXLogBlock here introduces any sort of alignment risk, so this should be alright. +#ifdef WIN32 +/* + * WriteFile() on Windows changes the current file position, hence we + * need an explicit lseek() here. See pg_pwrite() implementation in + * win32pwrite.c for more details. + */ Should we really surround this with a WIN32 check, or should we just unconditionally lseek() here? I understand that this helps avoid an extra system call on many platforms, but in theory another platform introduced in the future could have the same problem, and this seems like something that could easily be missed. Presumably we could do something fancier to indicate pg_pwrite()'s behavior in this regard, but I don't know if that sort of complexity is really worth it in order to save an lseek(). +iov[0].iov_base = zbuffer.data; This seems superfluous, but I don't think it's hurting anything. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart wrote: > > +PGAlignedXLogBlock zbuffer; > + > +memset(zbuffer.data, 0, XLOG_BLCKSZ); > > This seems excessive for only writing a single byte. Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1, wal_segment_size - 1). > +#ifdef WIN32 > +/* > + * XXX: It looks like on Windows, we need an explicit lseek() call > here > + * despite using pwrite() implementation from win32pwrite.c. > Otherwise > + * an error occurs. > + */ > > I think this comment is too vague. Can we describe the error in more > detail? Or better yet, can we fix it as a prerequisite to this patch set? The commit b6d8a60aba322678585ebe11dab072a37ac32905 brings back pg_pwrite() and its friends. This puts the responsibility of doing lseek(SEEK_SET) on the callers if they wish to. Please see the v5 patch set and review it further. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v5-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patch Description: Binary data v5-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch Description: Binary data
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Wed, Sep 28, 2022 at 5:43 PM Bharath Rupireddy wrote: > On Wed, Sep 28, 2022 at 12:32 AM Thomas Munro wrote: > > FWIW I doubt the OS itself will change released > > behaviour like that, but I did complain about the straight-up > > misleading documentation and they agreed and fixed it[1]. > > > > [1] https://github.com/MicrosoftDocs/sdk-api/pull/1309 > > Great! Is there any plan to request for change in WriteFile() to not > alter file position? Not from me. I stick to open source problems. Reporting bugs in documentation is legitimate self defence, though.
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Wed, Sep 28, 2022 at 12:32 AM Thomas Munro wrote: > > On Wed, Sep 28, 2022 at 1:03 AM Bharath Rupireddy > wrote:> > > On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro wrote: > > > Something like the attached. > > > > Isn't it also better to add notes in win32pread.c and win32pwrite.c > > about the callers doing lseek(SEEK_SET) if they wish to and Windows > > implementations changing file position? We can also add a TODO item > > about replacing pg_ versions with pread and friends someday when > > Windows fixes the issue? Having it in the commit and include/port.h is > > good, but the comments in win32pread.c and win32pwrite.c make life > > easier IMO. Otherwise, the patch LGTM. > > Thanks, will do. I'm looking forward to getting it in. > FWIW I doubt the OS itself will change released > behaviour like that, but I did complain about the straight-up > misleading documentation and they agreed and fixed it[1]. > > [1] https://github.com/MicrosoftDocs/sdk-api/pull/1309 Great! Is there any plan to request for change in WriteFile() to not alter file position? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Wed, Sep 28, 2022 at 1:03 AM Bharath Rupireddy wrote:> > On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro wrote: > > Something like the attached. > > Isn't it also better to add notes in win32pread.c and win32pwrite.c > about the callers doing lseek(SEEK_SET) if they wish to and Windows > implementations changing file position? We can also add a TODO item > about replacing pg_ versions with pread and friends someday when > Windows fixes the issue? Having it in the commit and include/port.h is > good, but the comments in win32pread.c and win32pwrite.c make life > easier IMO. Otherwise, the patch LGTM. Thanks, will do. FWIW I doubt the OS itself will change released behaviour like that, but I did complain about the straight-up misleading documentation and they agreed and fixed it[1]. After some looking around, the only way I could find to avoid this behaviour is to switch to async handles, which do behave as documented in this respect, but then you can't use plain read/write at all unless you write replacement wrappers for those too, and AFAICT that adds more system calls (start write, wait for write to finish) and complexity/weirdness without any real payoff so it seems like the wrong direction, at least without more research that I'm not pursuing currently. (FWIW in AIO porting experiments a while back we used async handles to get IOs running concurrently with CPU work and possibly other IOs so it was actually worth doing, but that was only for smgr and the main wal writing code where there's no intermixed plain read/write calls as you have here, so the problem doesn't even come up.) [1] https://github.com/MicrosoftDocs/sdk-api/pull/1309
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro wrote: > > Something like the attached. Isn't it also better to add notes in win32pread.c and win32pwrite.c about the callers doing lseek(SEEK_SET) if they wish to and Windows implementations changing file position? We can also add a TODO item about replacing pg_ versions with pread and friends someday when Windows fixes the issue? Having it in the commit and include/port.h is good, but the comments in win32pread.c and win32pwrite.c make life easier IMO. Otherwise, the patch LGTM. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Sep 27, 2022 at 6:43 PM Bharath Rupireddy wrote: > On Tue, Sep 27, 2022 at 3:08 AM Thomas Munro wrote: > > > > I don't think so, that's an extra kernel call. I think I'll just have > > to revert part of my recent change that removed the pg_ prefix from > > those function names in our code, and restore the comment that warns > > you about the portability hazard (I thought it went away with HP-UX > > 10, where we were literally calling lseek() before every write()). > > The majority of users of these functions don't intermix them with > > calls to read()/write(), so they don't care about the file position, > > so I think it's just something we'll have to continue to be mindful of > > in the places that do. > > Yes, all of the existing pwrite() callers don't care about the file > position, but the new callers such as the actual idea and patch > proposed here in this thread cares. > > Is this the commit cf112c122060568aa06efe4e6e6fb9b2dd4f1090 part of > which [1] you're planning to revert? Yeah, just the renaming parts of that. The lseek()-based emulation is definitely not coming back. Something like the attached. From fc72be32668f0c37b899fd922d6986f39d8a6a2a Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 27 Sep 2022 21:12:03 +1300 Subject: [PATCH] Restore pg_pread and friends. Commit cf112c12 was a little too hasty in getting rid of the pg_ prefixes where we use pread(), pwrite() and vectored variants. We dropped support for ancient Unixes that needed to use lseek() to implement replacements for those, but it turned out that Windows also changes the current position even when you pass in an offset to ReadFile() and WriteFile(), if the file handle is synchronous. Switching to asynchronous file handles would fix that, but have other complications. For now let's just put back the pg_ prefix and add some comments to highlight the non-standard side-effect. Reported-by: Bharath Rupireddy Discussion: https://postgr.es/m/20220923202439.GA1156054%40nathanxps13 --- .../pg_stat_statements/pg_stat_statements.c | 4 +- src/backend/access/heap/rewriteheap.c | 2 +- src/backend/access/transam/slru.c | 4 +- src/backend/access/transam/xlog.c | 4 +- src/backend/access/transam/xlogreader.c | 2 +- src/backend/access/transam/xlogrecovery.c | 2 +- src/backend/backup/basebackup.c | 2 +- src/backend/replication/walreceiver.c | 2 +- src/backend/storage/file/fd.c | 8 +-- src/backend/utils/init/miscinit.c | 2 +- src/bin/pg_test_fsync/pg_test_fsync.c | 50 +-- src/include/access/xlogreader.h | 4 +- src/include/port.h| 9 src/include/port/pg_iovec.h | 13 - src/include/port/win32_port.h | 4 +- src/port/preadv.c | 4 +- src/port/pwritev.c| 4 +- src/port/win32pread.c | 2 +- src/port/win32pwrite.c| 2 +- 19 files changed, 71 insertions(+), 53 deletions(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index ba868f0de9..73439c0199 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -2103,9 +2103,9 @@ qtext_store(const char *query, int query_len, if (fd < 0) goto error; - if (pwrite(fd, query, query_len, off) != query_len) + if (pg_pwrite(fd, query, query_len, off) != query_len) goto error; - if (pwrite(fd, "\0", 1, off + query_len) != 1) + if (pg_pwrite(fd, "\0", 1, off + query_len) != 1) goto error; CloseTransientFile(fd); diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 2f08fbe8d3..b01b39b008 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -1150,7 +1150,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r) /* write out tail end of mapping file (again) */ errno = 0; pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE); - if (pwrite(fd, data, len, xlrec->offset) != len) + if (pg_pwrite(fd, data, len, xlrec->offset) != len) { /* if write didn't set errno, assume problem is no disk space */ if (errno == 0) diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index c9a7b97949..b65cb49d7f 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -718,7 +718,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno) errno = 0; pgstat_report_wait_start(WAIT_EVENT_SLRU_READ); - if (pread(fd, shared->page_buffer[slotno], BLCKSZ, offset) != BLCKSZ) + if (pg_pread(fd,
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Sep 27, 2022 at 3:08 AM Thomas Munro wrote: > > I don't think so, that's an extra kernel call. I think I'll just have > to revert part of my recent change that removed the pg_ prefix from > those function names in our code, and restore the comment that warns > you about the portability hazard (I thought it went away with HP-UX > 10, where we were literally calling lseek() before every write()). > The majority of users of these functions don't intermix them with > calls to read()/write(), so they don't care about the file position, > so I think it's just something we'll have to continue to be mindful of > in the places that do. Yes, all of the existing pwrite() callers don't care about the file position, but the new callers such as the actual idea and patch proposed here in this thread cares. Is this the commit cf112c122060568aa06efe4e6e6fb9b2dd4f1090 part of which [1] you're planning to revert? If so, will we have the pg_{pwrite, pread} back? IIUC, we don't have lseek(SEEK_SET) in pg_{pwrite, pread} right? It is the callers responsibility to set the file position correctly if they wish to, isn't it? Oftentimes, developers miss the notes in the function comments and use these functions expecting them to not change file position which works well on non-Windows platforms but fails on Windows. This makes me think that we can have pwrite(), pread() introduced by cf112c122060568aa06efe4e6e6fb9b2dd4f1090 as-is and re-introduce pg_{pwrite, pread} with pwrite()/pread()+lseek(SEEK_SET) in win32pwrite.c and win32pread.c. These functions reduce the caller's efforts and reduce the duplicate code. If okay, I'm happy to lend my hands on this patch. Thoughts? > Unless, that is, I can find a way to stop it from doing that... I've > added this to my Windows to-do list. I am going to have a round of > Windows hacking quite soon. Thanks! I think it's time for me to start a new thread just for this to get more attention and opinion from other hackers and not to sidetrack the original idea and patch proposed in this thread. [1] https://github.com/postgres/postgres/blob/REL_14_STABLE/src/port/pwrite.c#L11 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Sep 27, 2022 at 10:37:38AM +1300, Thomas Munro wrote: > I don't think so, that's an extra kernel call. I think I'll just have > to revert part of my recent change that removed the pg_ prefix from > those function names in our code, and restore the comment that warns > you about the portability hazard (I thought it went away with HP-UX > 10, where we were literally calling lseek() before every write()). > The majority of users of these functions don't intermix them with > calls to read()/write(), so they don't care about the file position, > so I think it's just something we'll have to continue to be mindful of > in the places that do. Ah, you're right, it's probably best to avoid the extra system call for the majority of callers that don't care about the file position. I retract my previous message. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Sep 27, 2022 at 10:27 AM Nathan Bossart wrote: > On Mon, Sep 26, 2022 at 08:33:53PM +0530, Bharath Rupireddy wrote: > > Irrespective of what Windows does with file pointers in WriteFile(), > > should we add lseek(SEEK_SET) in our own pwrite()'s implementation, > > something like [5]? This is rather hackish without fully knowing what > > Windows does internally in WriteFile(), but this does fix inherent > > issues that our pwrite() callers (there are quite a number of places > > that use pwrite() and presumes file pointer doesn't change on Windows) > > may have on Windows. See the regression tests passing [6] with the fix > > [5]. > > I think so. I don't see why we would rather have each caller ensure > pwrite() behaves as documented. I don't think so, that's an extra kernel call. I think I'll just have to revert part of my recent change that removed the pg_ prefix from those function names in our code, and restore the comment that warns you about the portability hazard (I thought it went away with HP-UX 10, where we were literally calling lseek() before every write()). The majority of users of these functions don't intermix them with calls to read()/write(), so they don't care about the file position, so I think it's just something we'll have to continue to be mindful of in the places that do. Unless, that is, I can find a way to stop it from doing that... I've added this to my Windows to-do list. I am going to have a round of Windows hacking quite soon.
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Sep 26, 2022 at 08:33:53PM +0530, Bharath Rupireddy wrote: > Irrespective of what Windows does with file pointers in WriteFile(), > should we add lseek(SEEK_SET) in our own pwrite()'s implementation, > something like [5]? This is rather hackish without fully knowing what > Windows does internally in WriteFile(), but this does fix inherent > issues that our pwrite() callers (there are quite a number of places > that use pwrite() and presumes file pointer doesn't change on Windows) > may have on Windows. See the regression tests passing [6] with the fix > [5]. I think so. I don't see why we would rather have each caller ensure pwrite() behaves as documented. > + /* > +* On Windows, it is found that WriteFile() changes the file > pointer and we > +* want pwrite() to not change. Hence, we explicitly reset the > file pointer > +* to beginning of the file. > +*/ > + if (lseek(fd, 0, SEEK_SET) != 0) > + { > + _dosmaperr(GetLastError()); > + return -1; > + } > + > return result; > } Why reset to the beginning of the file? Shouldn't we reset it to what it was before the call to pwrite()? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sat, Sep 24, 2022 at 9:44 AM Thomas Munro wrote: > > Although WriteFile() with a synchronous file handle and an explicit > offset doesn't use the current file position, it appears that it still > changes it. :-( > > You'd think from the documentation[1] that that isn't the case, because it > says: > > "Considerations for working with synchronous file handles: > > * If lpOverlapped is NULL, the write operation starts at the current > file position and WriteFile does not return until the operation is > complete, and the system updates the file pointer before WriteFile > returns. > > * If lpOverlapped is not NULL, the write operation starts at the > offset that is specified in the OVERLAPPED structure and WriteFile > does not return until the write operation is complete. The system > updates the OVERLAPPED Internal and InternalHigh fields before > WriteFile returns." > > So it's explicitly saying the file pointer is updated in the first > bullet point and not the second, but src/port/win32pwrite.c is doing > the second thing. The WriteFile() and pwrite() implementation in win32pwrite.c are changing the file pointer on Windows, see the following and [4] for more details. # Running: pg_basebackup --no-sync -cfast -D C:\cirrus\build\testrun\pg_basebackup\010_pg_basebackup\data\tmp_test_sV4r/backup2 --no-manifest --waldir C:\cirrus\build\testrun\pg_basebackup\010_pg_basebackup\data\tmp_test_sV4r/xlog2 pg_basebackup: offset_before 0 and offset_after 16777216 aren't the same Assertion failed: offset_before == offset_after, file ../src/bin/pg_basebackup/walmethods.c, line 254 Irrespective of what Windows does with file pointers in WriteFile(), should we add lseek(SEEK_SET) in our own pwrite()'s implementation, something like [5]? This is rather hackish without fully knowing what Windows does internally in WriteFile(), but this does fix inherent issues that our pwrite() callers (there are quite a number of places that use pwrite() and presumes file pointer doesn't change on Windows) may have on Windows. See the regression tests passing [6] with the fix [5]. > Yet the following assertion added to Bharath's code > fails[2]: That was a very quick patch though, here's another version adding offset checks and an assertion [3], see the assertion failures here [4]. I also think that we need to discuss this issue separately. Thoughts? > [1] > https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position > [2] > https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup [3] - https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v2 [4] - https://api.cirrus-ci.com/v1/artifact/task/5294264635621376/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup [5] diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c index 7f2e62e8a7..542b548279 100644 --- a/src/port/win32pwrite.c +++ b/src/port/win32pwrite.c @@ -37,5 +37,16 @@ pwrite(int fd, const void *buf, size_t size, off_t offset) return -1; } + /* +* On Windows, it is found that WriteFile() changes the file pointer and we +* want pwrite() to not change. Hence, we explicitly reset the file pointer +* to beginning of the file. +*/ + if (lseek(fd, 0, SEEK_SET) != 0) + { + _dosmaperr(GetLastError()); + return -1; + } + return result; } [6] - https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v3 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sat, Sep 24, 2022 at 8:24 AM Nathan Bossart wrote: > +#ifdef WIN32 > +/* > + * XXX: It looks like on Windows, we need an explicit lseek() call > here > + * despite using pwrite() implementation from win32pwrite.c. > Otherwise > + * an error occurs. > + */ > > I think this comment is too vague. Can we describe the error in more > detail? Or better yet, can we fix it as a prerequisite to this patch set? Although WriteFile() with a synchronous file handle and an explicit offset doesn't use the current file position, it appears that it still changes it. :-( You'd think from the documentation[1] that that isn't the case, because it says: "Considerations for working with synchronous file handles: * If lpOverlapped is NULL, the write operation starts at the current file position and WriteFile does not return until the operation is complete, and the system updates the file pointer before WriteFile returns. * If lpOverlapped is not NULL, the write operation starts at the offset that is specified in the OVERLAPPED structure and WriteFile does not return until the write operation is complete. The system updates the OVERLAPPED Internal and InternalHigh fields before WriteFile returns." So it's explicitly saying the file pointer is updated in the first bullet point and not the second, but src/port/win32pwrite.c is doing the second thing. Yet the following assertion added to Bharath's code fails[2]: +++ b/src/bin/pg_basebackup/walmethods.c @@ -221,6 +221,10 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, if (pad_to_size && wwmethod->compression_algorithm == PG_COMPRESSION_NONE) { ssize_t rc; + off_t before_offset; + off_t after_offset; + + before_offset = lseek(fd, 0, SEEK_CUR); rc = pg_pwritev_zeros(fd, pad_to_size); @@ -231,6 +235,9 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname, return NULL; } + after_offset = lseek(fd, 0, SEEK_CUR); + Assert(before_offset == after_offset); + [1] https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position [2] https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
+PGAlignedXLogBlock zbuffer; + +memset(zbuffer.data, 0, XLOG_BLCKSZ); This seems excessive for only writing a single byte. +#ifdef WIN32 +/* + * XXX: It looks like on Windows, we need an explicit lseek() call here + * despite using pwrite() implementation from win32pwrite.c. Otherwise + * an error occurs. + */ I think this comment is too vague. Can we describe the error in more detail? Or better yet, can we fix it as a prerequisite to this patch set? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Wed, Sep 21, 2022 at 4:30 AM Nathan Bossart wrote: > > 0001 looks reasonable to me. Thanks for reviewing. > +errno = 0; > +rc = pg_pwritev_zeros(fd, pad_to_size); > > Do we need to reset errno? pg_pwritev_zeros() claims to set errno > appropriately. Right, pg_pwritev_zeros(), (rather pg_pwritev_with_retry() ensures that pwritev() or pwrite()) sets the correct errno, please see Thomas's comments [1] as well. Removed it. > +/* > + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if > + * writing more bytes per pg_pwritev_with_retry() call is proven to be more > + * performant. > + */ > +#define PWRITEV_BLCKSZ XLOG_BLCKSZ > > This seems like something we should sort out now instead of leaving as > future work. Given your recent note, I think we should just use > XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance > findings with different buffer sizes. Agreed. Removed the new structure and added a comment. Another change that I had to do was to allow lseek() even after pwrite() (via pg_pwritev_zeros()) on Windows in walmethods.c. Without this, the regression tests start to fail on Windows. And I figured out that it's not an issue with this patch, it looks like an issue with pwrite() implementation in win32pwrite.c, see the failure here [2], I plan to start a separate thread to discuss this. Please review the attached v4 patch set further. [1] https://www.postgresql.org/message-id/CA%2BhUKGJKwUrpP0igOFAv5khj3dbHvfWqLzFeo7vtNzDYXv_YZw%40mail.gmail.com [2] https://github.com/BRupireddy/postgres/tree/use_pwrite_without_lseek_on_windiws -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From c60026519becfc513dc06ddf889caa3b4b0fade3 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 23 Sep 2022 00:47:56 + Subject: [PATCH v4] Move pg_pwritev_with_retry() to file_utils.c Move pg_pwritev_with_retry() from file/fd.c to common/file_utils.c so that non-backend code or FRONTEND cases can also use it. --- src/backend/storage/file/fd.c | 65 - src/common/file_utils.c | 65 + src/include/common/file_utils.h | 7 src/include/storage/fd.h| 6 --- 4 files changed, 72 insertions(+), 71 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 073dab2be5..75ba178dfb 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -93,7 +93,6 @@ #include "common/pg_prng.h" #include "miscadmin.h" #include "pgstat.h" -#include "port/pg_iovec.h" #include "portability/mem.h" #include "postmaster/startup.h" #include "storage/fd.h" @@ -3738,67 +3737,3 @@ data_sync_elevel(int elevel) { return data_sync_retry ? elevel : PANIC; } - -/* - * A convenience wrapper for pwritev() that retries on partial write. If an - * error is returned, it is unspecified how much has been written. - */ -ssize_t -pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) -{ - struct iovec iov_copy[PG_IOV_MAX]; - ssize_t sum = 0; - ssize_t part; - - /* We'd better have space to make a copy, in case we need to retry. */ - if (iovcnt > PG_IOV_MAX) - { - errno = EINVAL; - return -1; - } - - for (;;) - { - /* Write as much as we can. */ - part = pwritev(fd, iov, iovcnt, offset); - if (part < 0) - return -1; - -#ifdef SIMULATE_SHORT_WRITE - part = Min(part, 4096); -#endif - - /* Count our progress. */ - sum += part; - offset += part; - - /* Step over iovecs that are done. */ - while (iovcnt > 0 && iov->iov_len <= part) - { - part -= iov->iov_len; - ++iov; - --iovcnt; - } - - /* Are they all done? */ - if (iovcnt == 0) - { - /* We don't expect the kernel to write more than requested. */ - Assert(part == 0); - break; - } - - /* - * Move whatever's left to the front of our mutable copy and adjust - * the leading iovec. - */ - Assert(iovcnt > 0); - memmove(iov_copy, iov, sizeof(*iov) * iovcnt); - Assert(iov->iov_len > part); - iov_copy[0].iov_base = (char *) iov_copy[0].iov_base + part; - iov_copy[0].iov_len -= part; - iov = iov_copy; - } - - return sum; -} diff --git a/src/common/file_utils.c b/src/common/file_utils.c index df4d6d240c..4af216e56c 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -28,6 +28,7 @@ #ifdef FRONTEND #include "common/logging.h" #endif +#include "port/pg_iovec.h" #ifdef FRONTEND @@ -460,3 +461,67 @@ get_dirent_type(const char *path, return result; } + +/* + * A convenience wrapper for pwritev() that retries on partial write. If an + * error is returned, it is unspecified how much has been written. + */ +ssize_t +pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) +{ + struct iovec iov_copy[PG_IOV_MAX]; + ssize_t sum = 0; + ssize_t part; + + /* We'd better have space to make a copy, in case we need to
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Sep 20, 2022 at 04:00:26PM -0700, Nathan Bossart wrote: > On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote: >> I'm attaching v5 patch-set. I've addressed review comments received so >> far and fixed a compiler warning that CF bot complained about. >> >> Please review it further. > > 0001 looks reasonable to me. > > +errno = 0; > +rc = pg_pwritev_zeros(fd, pad_to_size); > > Do we need to reset errno? pg_pwritev_zeros() claims to set errno > appropriately. > > +/* > + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if > + * writing more bytes per pg_pwritev_with_retry() call is proven to be more > + * performant. > + */ > +#define PWRITEV_BLCKSZ XLOG_BLCKSZ > > This seems like something we should sort out now instead of leaving as > future work. Given your recent note, I think we should just use > XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance > findings with different buffer sizes. I also noticed that the latest patch set no longer applies, so I've marked the commitfest entry as waiting-on-author. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote: > I'm attaching v5 patch-set. I've addressed review comments received so > far and fixed a compiler warning that CF bot complained about. > > Please review it further. 0001 looks reasonable to me. +errno = 0; +rc = pg_pwritev_zeros(fd, pad_to_size); Do we need to reset errno? pg_pwritev_zeros() claims to set errno appropriately. +/* + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if + * writing more bytes per pg_pwritev_with_retry() call is proven to be more + * performant. + */ +#define PWRITEV_BLCKSZ XLOG_BLCKSZ This seems like something we should sort out now instead of leaving as future work. Given your recent note, I think we should just use XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance findings with different buffer sizes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Mon, Aug 8, 2022 at 6:10 PM Bharath Rupireddy wrote: > > I played with a simple insert use-case [1] that generates ~380 WAL > files, with different block sizes. To my surprise, I have not seen any > improvement with larger block sizes. I may be doing something wrong > here, suggestions on to test and see the benefits are welcome. > > > > I think this should also handle the remainder after processing whole > > > blocks, just for completeness. If I call the code as presented with size > > > 8193, I think this code will only write 8192 bytes. > > > > Hm, I will fix it. > > Fixed. > > I'm attaching v5 patch-set. I've addressed review comments received so > far and fixed a compiler warning that CF bot complained about. > > Please review it further. I tried to vary the zero buffer size to see if there's any huge benefit for the WAL-generating queries. Unfortunately, I didn't see any benefit on my dev system (16 vcore, 512GB SSD, 32GB RAM) . The use case I've tried is at [1] and the results are at [2]. Having said that, the use of pg_pwritev_with_retry() in walmethods.c will definitely reduce number of system calls - on HEAD the dir_open_for_write() makes pad_to_size/XLOG_BLCKSZ i.e. 16MB/8KB = 2,048 write() calls and with patch it makes only 64 pg_pwritev_with_retry() calls with XLOG_BLCKSZ zero buffer size. The proposed patches will provide straight 32x reduction in system calls (for pg_receivewal and pg_basebackup) apart from the safety against partial writes. [1] /* built source code with release flags */ ./configure --with-zlib --enable-depend --prefix=$PWD/inst/ --with-openssl --with-readline --with-perl --with-libxml CFLAGS='-O2' > install.log && make -j 8 install > install.log 2>&1 & \q ./pg_ctl -D data -l logfile stop rm -rf data /* ensured that nothing exists in OS page cache */ free -m sudo su sync; echo 3 > /proc/sys/vm/drop_caches exit free -m ./initdb -D data ./pg_ctl -D data -l logfile start ./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "64GB";' ./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";' ./psql -d postgres -c 'ALTER SYSTEM SET work_mem = "16MB";' ./psql -d postgres -c 'ALTER SYSTEM SET checkpoint_timeout = "1d";' ./pg_ctl -D data -l logfile restart ./psql -d postgres -c 'create table foo(bar int);' ./psql -d postgres \timing insert into foo select * from generate_series(1, 1); /* this query generates about 385 WAL files, no checkpoint hence no recycle of old WAL files, all new WAL files */ [2] HEAD Time: 84249.535 ms (01:24.250) HEAD with wal_init_zero off Time: 75086.300 ms (01:15.086) #definePWRITEV_BLCKSZXLOG_BLCKSZ Time: 85254.302 ms (01:25.254) #definePWRITEV_BLCKSZ(4 * XLOG_BLCKSZ) Time: 83542.885 ms (01:23.543) #definePWRITEV_BLCKSZ(16 * XLOG_BLCKSZ) Time: 84035.770 ms (01:24.036) #definePWRITEV_BLCKSZ(64 * XLOG_BLCKSZ) Time: 84749.021 ms (01:24.749) #definePWRITEV_BLCKSZ(256 * XLOG_BLCKSZ) Time: 84273.466 ms (01:24.273) #definePWRITEV_BLCKSZ(512 * XLOG_BLCKSZ) Time: 84233.576 ms (01:24.234) -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Aug 7, 2022 at 9:22 PM Bharath Rupireddy wrote: > > On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro wrote: > > > > > A second thing is that pg_pwritev_with_retry_and_write_zeros() is > > > designed to work on WAL segments initialization and it uses > > > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing > > > in its name that tells us so. This makes me question whether > > > file_utils.c is a good location for this second thing. Could a new > > > file be a better location? We have a xlogutils.c in the backend, and > > > a name similar to that in src/common/ would be one possibility. > > > > Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or > > maybe it's OK to use BLCKSZ with a comment to say that it's a bit > > arbitrary, or maybe it's better to define a new zero buffer of some > > arbitrary size just in this code if that is too strange. We could > > experiment with different size buffers to see how it performs, bearing > > in mind that every time we double it you halve the number of system > > calls, but also bearing in mind that at some point it's too much for > > the stack. I can tell you that the way that code works today was not > > really written with performance in mind (unlike, say, the code > > reverted from 9.4 that tried to do this with posix_fallocate()), it > > was just finding an excuse to call pwritev(), to exercise new fallback > > code being committed for use by later AIO stuff (more patches coming > > soon). The retry support was added because it seemed plausible that > > some system out there would start to do short writes as we cranked up > > the sizes for some implementation reason other than ENOSPC, so we > > should make a reusable retry routine. > > Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ > reduces the system calls to half (right now, pg_pwritev_with_retry() > gets called 64 times per 16MB WAL file, it writes in the batches of 32 > blocks per call). > > Is there a ready-to-use tool or script or specific settings for > pgbench (pgbench command line options or GUC settings) that I can play > with to measure the performance? I played with a simple insert use-case [1] that generates ~380 WAL files, with different block sizes. To my surprise, I have not seen any improvement with larger block sizes. I may be doing something wrong here, suggestions on to test and see the benefits are welcome. > > I think this should also handle the remainder after processing whole > > blocks, just for completeness. If I call the code as presented with size > > 8193, I think this code will only write 8192 bytes. > > Hm, I will fix it. Fixed. I'm attaching v5 patch-set. I've addressed review comments received so far and fixed a compiler warning that CF bot complained about. Please review it further. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/ v3-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patch Description: Binary data v3-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch Description: Binary data
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro wrote: > > > A second thing is that pg_pwritev_with_retry_and_write_zeros() is > > designed to work on WAL segments initialization and it uses > > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing > > in its name that tells us so. This makes me question whether > > file_utils.c is a good location for this second thing. Could a new > > file be a better location? We have a xlogutils.c in the backend, and > > a name similar to that in src/common/ would be one possibility. > > Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or > maybe it's OK to use BLCKSZ with a comment to say that it's a bit > arbitrary, or maybe it's better to define a new zero buffer of some > arbitrary size just in this code if that is too strange. We could > experiment with different size buffers to see how it performs, bearing > in mind that every time we double it you halve the number of system > calls, but also bearing in mind that at some point it's too much for > the stack. I can tell you that the way that code works today was not > really written with performance in mind (unlike, say, the code > reverted from 9.4 that tried to do this with posix_fallocate()), it > was just finding an excuse to call pwritev(), to exercise new fallback > code being committed for use by later AIO stuff (more patches coming > soon). The retry support was added because it seemed plausible that > some system out there would start to do short writes as we cranked up > the sizes for some implementation reason other than ENOSPC, so we > should make a reusable retry routine. Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ reduces the system calls to half (right now, pg_pwritev_with_retry() gets called 64 times per 16MB WAL file, it writes in the batches of 32 blocks per call). Is there a ready-to-use tool or script or specific settings for pgbench (pgbench command line options or GUC settings) that I can play with to measure the performance? > I think this should also handle the remainder after processing whole > blocks, just for completeness. If I call the code as presented with size > 8193, I think this code will only write 8192 bytes. Hm, I will fix it. > I think if this ever needs to work on O_DIRECT files there would be an > alignment constraint on the buffer and size, but I don't think we have > to worry about that for now. We can add a comment about the above limitation, if required. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Aug 7, 2022 at 7:56 PM Michael Paquier wrote: > FWIW, when it comes to that we have a couple of routines that just use > '0' to mean such a thing, aka palloc0(). I find 0002 confusing, as it > introduces in fe_utils.c a new wrapper > (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper > (pg_pwritev_with_retry) for pwrite(). What about pg_write_initial_zeros(fd, size)? How it writes zeros (ie using vector I/O, and retrying) seems to be an internal detail, no? "initial" to make it clearer that it's at offset 0, or maybe pg_pwrite_zeros(fd, size, offset). > A second thing is that pg_pwritev_with_retry_and_write_zeros() is > designed to work on WAL segments initialization and it uses > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing > in its name that tells us so. This makes me question whether > file_utils.c is a good location for this second thing. Could a new > file be a better location? We have a xlogutils.c in the backend, and > a name similar to that in src/common/ would be one possibility. Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or maybe it's OK to use BLCKSZ with a comment to say that it's a bit arbitrary, or maybe it's better to define a new zero buffer of some arbitrary size just in this code if that is too strange. We could experiment with different size buffers to see how it performs, bearing in mind that every time we double it you halve the number of system calls, but also bearing in mind that at some point it's too much for the stack. I can tell you that the way that code works today was not really written with performance in mind (unlike, say, the code reverted from 9.4 that tried to do this with posix_fallocate()), it was just finding an excuse to call pwritev(), to exercise new fallback code being committed for use by later AIO stuff (more patches coming soon). The retry support was added because it seemed plausible that some system out there would start to do short writes as we cranked up the sizes for some implementation reason other than ENOSPC, so we should make a reusable retry routine. I think this should also handle the remainder after processing whole blocks, just for completeness. If I call the code as presented with size 8193, I think this code will only write 8192 bytes. I think if this ever needs to work on O_DIRECT files there would be an alignment constraint on the buffer and size, but I don't think we have to worry about that for now.
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Aug 07, 2022 at 10:41:49AM +0530, Bharath Rupireddy wrote: > Agree. I separated out the changes. + +/* + * A convenience wrapper for pwritev() that retries on partial write. If an + * error is returned, it is unspecified how much has been written. + */ +ssize_t +pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) If moving this routine, this could use a more explicit description, especially on errno, for example, that could be consumed by the caller on failure to know what's happening. >> +/* >> + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the >> given >> + * file of size total_sz in batches of size block_sz. >> + */ >> +ssize_t >> +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz) >> >> Hmm, why not give it a proper name that says it writes zeroes? > > Done. FWIW, when it comes to that we have a couple of routines that just use '0' to mean such a thing, aka palloc0(). I find 0002 confusing, as it introduces in fe_utils.c a new wrapper (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper (pg_pwritev_with_retry) for pwrite(). A second thing is that pg_pwritev_with_retry_and_write_zeros() is designed to work on WAL segments initialization and it uses XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing in its name that tells us so. This makes me question whether file_utils.c is a good location for this second thing. Could a new file be a better location? We have a xlogutils.c in the backend, and a name similar to that in src/common/ would be one possibility. -- Michael signature.asc Description: PGP signature
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Aug 7, 2022 at 7:43 AM Thomas Munro wrote: > > On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy > wrote: > > On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier wrote: > > Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h > > so that everyone can use it. > > > > > > Thoughts? > > > > > > Makes sense to me for the WAL segment pre-padding initialization, as > > > we still want to point to the beginning of the segment after we are > > > done with the pre-padding, and the code has an extra lseek(). > > > > Thanks. I attached the v1 patch, please review it. > > Hi Bharath, > > +1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the > commit message should say that is happening. Maybe the move should > even be in a separate patch (IMHO it's nice to separate mechanical > change patches from new logic/work patches). Agree. I separated out the changes. > +/* > + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the > given > + * file of size total_sz in batches of size block_sz. > + */ > +ssize_t > +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz) > > Hmm, why not give it a proper name that says it writes zeroes? Done. > Size arguments around syscall-like facilities are usually size_t. > > +memset(zbuffer.data, 0, block_sz); > > I believe the modern fashion as of a couple of weeks ago is to tell > the compiler to zero-initialise. Since it's a union you'd need > designated initialiser syntax, something like zbuffer = { .data = {0} > }? Hm, but we have many places still using memset(). If we were to change these syntaxes, IMO, it must be done separately. > +iov[i].iov_base = zbuffer.data; > +iov[i].iov_len = block_sz; > > How can it be OK to use caller supplied block_sz, when > sizeof(zbuffer.data) is statically determined? What is the point of > this argument? Yes, removed block_sz function parameter. > -dir_data->lasterrno = errno; > +/* If errno isn't set, assume problem is no disk space */ > +dir_data->lasterrno = errno ? errno : ENOSPC; > > This coding pattern is used in places where we blame short writes on > lack of disk space without bothering to retry. But the code used in > this patch already handles short writes: it always retries, until > eventually, if you really are out of disk space, you should get an > actual ENOSPC from the operating system. So I don't think the > guess-it-must-be-ENOSPC technique applies here. Done. Thanks for reviewing. PSA v2 patch-set. Also,I added a CF entry https://commitfest.postgresql.org/39/3803/ to give the patches a chance to get tested. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/ From c862c9d079c3f78d2eb0405d5c4addd93847f102 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sun, 7 Aug 2022 04:40:10 + Subject: [PATCH v2] Move pg_pwritev_with_retry() to file_utils.c Move pg_pwritev_with_retry() from file/fd.c to common/file_utils.c so that non-backend code or FRONTEND cases can also use it. --- src/backend/storage/file/fd.c | 65 - src/common/file_utils.c | 65 + src/include/common/file_utils.h | 7 src/include/storage/fd.h| 6 --- 4 files changed, 72 insertions(+), 71 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index efb34d4dcb..ac611b836f 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -95,7 +95,6 @@ #include "common/pg_prng.h" #include "miscadmin.h" #include "pgstat.h" -#include "port/pg_iovec.h" #include "portability/mem.h" #include "postmaster/startup.h" #include "storage/fd.h" @@ -3747,67 +3746,3 @@ data_sync_elevel(int elevel) { return data_sync_retry ? elevel : PANIC; } - -/* - * A convenience wrapper for pwritev() that retries on partial write. If an - * error is returned, it is unspecified how much has been written. - */ -ssize_t -pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) -{ - struct iovec iov_copy[PG_IOV_MAX]; - ssize_t sum = 0; - ssize_t part; - - /* We'd better have space to make a copy, in case we need to retry. */ - if (iovcnt > PG_IOV_MAX) - { - errno = EINVAL; - return -1; - } - - for (;;) - { - /* Write as much as we can. */ - part = pwritev(fd, iov, iovcnt, offset); - if (part < 0) - return -1; - -#ifdef SIMULATE_SHORT_WRITE - part = Min(part, 4096); -#endif - - /* Count our progress. */ - sum += part; - offset += part; - - /* Step over iovecs that are done. */ - while (iovcnt > 0 && iov->iov_len <= part) - { - part -= iov->iov_len; - ++iov; - --iovcnt; - } - - /* Are they all done? */ - if (iovcnt == 0) - { - /* We don't expect the kernel to write more than requested. */ - Assert(part == 0); - break; - } - - /* - * Move whatever's left to the front of our mutable copy and adjust - * the leading
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy wrote: > On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier wrote: > Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h > so that everyone can use it. > > > > Thoughts? > > > > Makes sense to me for the WAL segment pre-padding initialization, as > > we still want to point to the beginning of the segment after we are > > done with the pre-padding, and the code has an extra lseek(). > > Thanks. I attached the v1 patch, please review it. Hi Bharath, +1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the commit message should say that is happening. Maybe the move should even be in a separate patch (IMHO it's nice to separate mechanical change patches from new logic/work patches). +/* + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given + * file of size total_sz in batches of size block_sz. + */ +ssize_t +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz) Hmm, why not give it a proper name that says it writes zeroes? Size arguments around syscall-like facilities are usually size_t. +memset(zbuffer.data, 0, block_sz); I believe the modern fashion as of a couple of weeks ago is to tell the compiler to zero-initialise. Since it's a union you'd need designated initialiser syntax, something like zbuffer = { .data = {0} }? +iov[i].iov_base = zbuffer.data; +iov[i].iov_len = block_sz; How can it be OK to use caller supplied block_sz, when sizeof(zbuffer.data) is statically determined? What is the point of this argument? -dir_data->lasterrno = errno; +/* If errno isn't set, assume problem is no disk space */ +dir_data->lasterrno = errno ? errno : ENOSPC; This coding pattern is used in places where we blame short writes on lack of disk space without bothering to retry. But the code used in this patch already handles short writes: it always retries, until eventually, if you really are out of disk space, you should get an actual ENOSPC from the operating system. So I don't think the guess-it-must-be-ENOSPC technique applies here.
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier wrote: > > On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote: > > I noticed that dir_open_for_write() in walmethods.c uses write() for > > WAL file initialization (note that this code is used by pg_receivewal > > and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in > > XLogFileInitInternal() to avoid partial writes. Do we need to fix > > this? > > 0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/ Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h so that everyone can use it. > > Thoughts? > > Makes sense to me for the WAL segment pre-padding initialization, as > we still want to point to the beginning of the segment after we are > done with the pre-padding, and the code has an extra lseek(). Thanks. I attached the v1 patch, please review it. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/ v1-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch Description: Binary data
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote: > I noticed that dir_open_for_write() in walmethods.c uses write() for > WAL file initialization (note that this code is used by pg_receivewal > and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in > XLogFileInitInternal() to avoid partial writes. Do we need to fix > this? 0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/ > Thoughts? Makes sense to me for the WAL segment pre-padding initialization, as we still want to point to the beginning of the segment after we are done with the pre-padding, and the code has an extra lseek(). -- Michael signature.asc Description: PGP signature