subject:"Re\: Use pg_pwritev_with_retry\(\) instead of write\(\) in dir_open_for

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-16 Thread Andres Freund

Hi,

On 2023-03-07 15:44:46 +1300, Thomas Munro wrote:
> On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro  wrote:
> > Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
> > initialised that struct.  I guess it wants {{0}} instead of {0}.
> > Apparently old GCC was wrong about that warning[1], but that system
> > doesn't have the back-patched fixes?  Not sure.
> 
> Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> have old buggy compilers set to run with -Werror.

I think it's actively harmful to do so. Avoiding warnings on a > 10 year old
compiler a) is a waste of time b) unnecessarily requires making our code
uglier.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-07 Thread Alvaro Herrera

On 2023-Mar-07, Julien Rouhaud wrote:

> I registered lapwing as a 32b Debian 7 so I thought it would be expected to
> keep it as-is rather than upgrading to all newer major Debian versions,
> especially since there were newer debian animal registered (no 32b though
> AFAICS).  I'm not opposed to upgrading it but I think there's still value in
> having somewhat old packages versions being tested, especially since there
> isn't much 32b coverage of those.  I would be happy to register a newer 32b
> version, or even sid, if needed but the -m32 part on the CI makes me think
> there isn't much value doing that now.

I think a pure 32bit animal running contemporary Debian would be better
than just ditching the animal completely, as would appear to be the
alternative, precisely because we have no other 32bit machine running
x86 Linux.

Maybe you can have *two* animals on the same machine: one running the
old Debian without -Werror, and the one with new Debian and that flag
kept.

I think CI is not a replacement for the buildfarm.  It helps catche some
problems earlier, but we shouldn't think that we no longer need some
buildfarm animals because CI runs those configs.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Thomas Munro

On Tue, Mar 7, 2023 at 7:47 PM Julien Rouhaud  wrote:
> I registered lapwing as a 32b Debian 7 so I thought it would be expected to
> keep it as-is rather than upgrading to all newer major Debian versions,
> especially since there were newer debian animal registered (no 32b though
> AFAICS).

Animals do get upgraded: see the "w. e. f." ("with effect from") line
in https://buildfarm.postgresql.org/cgi-bin/show_members.pl which
comes from people running something like ./update_personality.pl
--os-version "11" so that it shows up on the website.

>  I'm not opposed to upgrading it but I think there's still value in
> having somewhat old packages versions being tested, especially since there
> isn't much 32b coverage of those.  I would be happy to register a newer 32b
> version, or even sid, if needed but the -m32 part on the CI makes me think
> there isn't much value doing that now.

Totally up to you as an animal zoo keeper but in my humble opinion the
interesting range of Debian releases currently is 11-13, or maybe 10
if you really want to test the LTS/old-stable release (and CI is
testing 11).

> I think this is the first time that a problem raised by -Werror on that old
> animal is actually a false positive, while there were many times it reported
> useful stuff.  Now this has been up for years before we got better CI tooling,
> especially with -m32 support, so there might not be any value to have it
> anymore.  As I mentioned at [1] I don't mind removing it or just work on
> upgrading any dependency (or removing known buggy compiler flags) to keep it
> without being annoying.  In any case I'm usually quite fast at reacting to any
> problem/complaint on that animal, so you don't have to worry about the
> buildfarm being red too long if it came to that.

Yeah, it's given us lots of useful data, thanks.  Personally I would
upgrade it so it keeps telling us useful things but I feel like I've
said enough about that so I'll shut up now :-)  Re: being red too
long... yeah that reminds me, I really need to fix seawasp ASAP...

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Julien Rouhaud

On Tue, Mar 07, 2023 at 07:14:51PM +1300, Thomas Munro wrote:
> On Tue, Mar 7, 2023 at 5:32 PM Michael Paquier  wrote:
> > On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> > > Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> > > have old buggy compilers set to run with -Werror.
> >
> > Yes, as far as I can see when investigating the issue, this is an old
> > bug of gcc when detecting where the initialization needs to be
> > applied.  And at the same time the fix is deadly simple, so the
> > current statu-quo does not sound that bad to me.  Note that lapwing is
> > one of the only animals testing 32b builds, and it has saved from
> > quite few bugs over the years.
>
> Yeah, but I'm just wondering, why not run a current release on it[1]?
> Debian is one of the few distributions still supporting 32 bit
> kernels, and it's good to test rare things, but AFAIK the primary
> reason we finish up with EOL'd OSes in the 'farm is because they have
> been forgotten (the secondary reason is because they couldn't be
> upgraded because the OS dropped the [micro]architecture).

I registered lapwing as a 32b Debian 7 so I thought it would be expected to
keep it as-is rather than upgrading to all newer major Debian versions,
especially since there were newer debian animal registered (no 32b though
AFAICS).  I'm not opposed to upgrading it but I think there's still value in
having somewhat old packages versions being tested, especially since there
isn't much 32b coverage of those.  I would be happy to register a newer 32b
version, or even sid, if needed but the -m32 part on the CI makes me think
there isn't much value doing that now.

Now about the -Werror:

> BTW CI also tests 32 bit with -m32 on Debian, but with a 64 bit
> kernel, which probably doesn't change much at the level we care about,
> so maybe this doesn't matter much... just sharing an observation that
> we're wasting time thinking about an OS release that gave up the ghost
> in 2016, because it is running with -Werror.  *shrug*

I think this is the first time that a problem raised by -Werror on that old
animal is actually a false positive, while there were many times it reported
useful stuff.  Now this has been up for years before we got better CI tooling,
especially with -m32 support, so there might not be any value to have it
anymore.  As I mentioned at [1] I don't mind removing it or just work on
upgrading any dependency (or removing known buggy compiler flags) to keep it
without being annoying.  In any case I'm usually quite fast at reacting to any
problem/complaint on that animal, so you don't have to worry about the
buildfarm being red too long if it came to that.

[1] 
https://www.postgresql.org/message-id/20220921155025.wdixzbrt2uzbi6vz%40jrouhaud

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Thomas Munro

On Tue, Mar 7, 2023 at 5:32 PM Michael Paquier  wrote:
> On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> > Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> > have old buggy compilers set to run with -Werror.
>
> Yes, as far as I can see when investigating the issue, this is an old
> bug of gcc when detecting where the initialization needs to be
> applied.  And at the same time the fix is deadly simple, so the
> current statu-quo does not sound that bad to me.  Note that lapwing is
> one of the only animals testing 32b builds, and it has saved from
> quite few bugs over the years.

Yeah, but I'm just wondering, why not run a current release on it[1]?
Debian is one of the few distributions still supporting 32 bit
kernels, and it's good to test rare things, but AFAIK the primary
reason we finish up with EOL'd OSes in the 'farm is because they have
been forgotten (the secondary reason is because they couldn't be
upgraded because the OS dropped the [micro]architecture).  Unlike
vintage SPARC, actual users might plausibly be running a current
release on a 32 bit Intel system, I guess (maybe on a Quark
microcontroller?)?

BTW CI also tests 32 bit with -m32 on Debian, but with a 64 bit
kernel, which probably doesn't change much at the level we care about,
so maybe this doesn't matter much... just sharing an observation that
we're wasting time thinking about an OS release that gave up the ghost
in 2016, because it is running with -Werror.  *shrug*

[1] https://wiki.debian.org/DebianReleases

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Michael Paquier

On Mon, Mar 06, 2023 at 02:29:50PM -0800, Andres Freund wrote:
> Thanks.

Sure, no problem.  If there is anything else needed for this thread,
feel free to ping me here.
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Michael Paquier

On Tue, Mar 07, 2023 at 03:44:46PM +1300, Thomas Munro wrote:
> On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro  wrote:
>> Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
>> initialised that struct.  I guess it wants {{0}} instead of {0}.
>> Apparently old GCC was wrong about that warning[1], but that system
>> doesn't have the back-patched fixes?  Not sure.

6392f2a was one such case.

> Oh, you already pushed a fix.  But now I'm wondering if it's useful to
> have old buggy compilers set to run with -Werror.

Yes, as far as I can see when investigating the issue, this is an old
bug of gcc when detecting where the initialization needs to be
applied.  And at the same time the fix is deadly simple, so the
current statu-quo does not sound that bad to me.  Note that lapwing is
one of the only animals testing 32b builds, and it has saved from
quite few bugs over the years.
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Thomas Munro

On Tue, Mar 7, 2023 at 3:42 PM Thomas Munro  wrote:
> Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
> initialised that struct.  I guess it wants {{0}} instead of {0}.
> Apparently old GCC was wrong about that warning[1], but that system
> doesn't have the back-patched fixes?  Not sure.

Oh, you already pushed a fix.  But now I'm wondering if it's useful to
have old buggy compilers set to run with -Werror.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Thomas Munro

On Mon, Mar 6, 2023 at 5:30 PM Michael Paquier  wrote:
> On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote:
> > I ran some tests on my dev system [1] and I don't see much difference
> > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
> > Andres to keep the code simple and elegant.
>
> This thread has stalled for a couple of weeks, so I have gone back to
> it.  Testing on a tmpfs I am not seeing a difference if performance
> for any of the approaches discussed.  At the end, as I am the one
> behind the introduction of pg_pwrite_zeros(), I have applied v3 after
> switches the size and offset parameters to be the same way as in v4.

Apparently ye olde GCC 4.7 on "lapwing" doesn't like the way you
initialised that struct.  I guess it wants {{0}} instead of {0}.
Apparently old GCC was wrong about that warning[1], but that system
doesn't have the back-patched fixes?  Not sure.

[1] 
https://stackoverflow.com/questions/63355760/how-standard-is-the-0-initializer-in-c89

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-06 Thread Andres Freund

Hi,

On 2023-03-06 13:29:50 +0900, Michael Paquier wrote:
> On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote:
> > I ran some tests on my dev system [1] and I don't see much difference
> > between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
> > Andres to keep the code simple and elegant.
>
> This thread has stalled for a couple of weeks, so I have gone back to
> it.  Testing on a tmpfs I am not seeing a difference if performance
> for any of the approaches discussed.  At the end, as I am the one
> behind the introduction of pg_pwrite_zeros(), I have applied v3 after
> switches the size and offset parameters to be the same way as in v4.

Thanks.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-05 Thread Michael Paquier

On Mon, Feb 20, 2023 at 01:54:00PM +0530, Bharath Rupireddy wrote:
> I ran some tests on my dev system [1] and I don't see much difference
> between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
> Andres to keep the code simple and elegant.

This thread has stalled for a couple of weeks, so I have gone back to
it.  Testing on a tmpfs I am not seeing a difference if performance
for any of the approaches discussed.  At the end, as I am the one
behind the introduction of pg_pwrite_zeros(), I have applied v3 after 
switches the size and offset parameters to be the same way as in v4.
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-20 Thread Bharath Rupireddy

On Mon, Feb 20, 2023 at 11:03 AM Michael Paquier  wrote:
>
> > I'm inclined to go with my version, with the argument order swapped to
> > Bharath's order.
>
> Okay.  That's fine by me.

I ran some tests on my dev system [1] and I don't see much difference
between v3 and v4. So, +1 for v3 patch (+ argument order swap) from
Andres to keep the code simple and elegant.

[1]
HEAD: 16MB (12.231 ms), 8190 Bytes (0.199 ms), 8192 Bytes (0.176 ms),
1GB (603.668 ms), 10GB (21184.936 ms (00:21.185))
v3 patch: 16MB (12.632 ms), 8190 Bytes (0.183 ms), 8192 Bytes (0.166
ms), 1GB (610.428 ms), 10GB (22647.308 ms (00:22.647))
v4 patch: 16MB (12.044 ms), 8190 Bytes (0.167 ms), 8192 Bytes (0.139
ms), 1GB (603.848 ms), 10GB (21225.331 ms (00:21.225))

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-19 Thread Michael Paquier

On Fri, Feb 17, 2023 at 09:31:14AM -0800, Andres Freund wrote:
> On 2023-02-17 16:19:46 +0900, Michael Paquier wrote:
>> But it looks like I misunderstood what this quote meant compared to
>> what v3 does.  It is true that v3 sets iov_len and iov_base more than
>> needed when writing sizes larger than BLCKSZ.
> 
> I don't think it does for writes larger than BLCKSZ, it just does more for
> writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going
> to be spent elsewhere.

Yep.

>> Seems like you think that it is not really going to matter much to track
>> which iovecs have been already initialized during the first loop on
>> pg_pwritev_with_retry() to keep the code shorter?
> 
> Yes. I'd bet that, in the unlikely case you're going to see any difference at
> all, unconditionally initializing is going to win.
> 
> Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every
> call. Even if we could do some further optimizations of what I did in the
> patch, you can initialize needed IOVs repeatedly a *lot* of times, before it
> shows up...
> 
> I'm inclined to go with my version, with the argument order swapped to
> Bharath's order.

Okay.  That's fine by me.
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-17 Thread Andres Freund

Hi,

On 2023-02-17 16:19:46 +0900, Michael Paquier wrote:
> But it looks like I misunderstood what this quote meant compared to
> what v3 does.  It is true that v3 sets iov_len and iov_base more than
> needed when writing sizes larger than BLCKSZ.

I don't think it does for writes larger than BLCKSZ, it just does more for
writes larger than PG_IKOV_MAX * BLCKSZ. But in those cases CPU time is going
to be spent elsewhere.

> Seems like you think that it is not really going to matter much to track
> which iovecs have been already initialized during the first loop on
> pg_pwritev_with_retry() to keep the code shorter?

Yes. I'd bet that, in the unlikely case you're going to see any difference at
all, unconditionally initializing is going to win.

Right now we memset() 8KB, and iterate over 32 IOVs, unconditionally, on every
call. Even if we could do some further optimizations of what I did in the
patch, you can initialize needed IOVs repeatedly a *lot* of times, before it
shows up...

I'm inclined to go with my version, with the argument order swapped to
Bharath's order.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-16 Thread Michael Paquier

On Thu, Feb 16, 2023 at 11:00:20AM -0800, Andres Freund wrote:
> I don't really understand this bit?

As of this message, I saw this quote:
https://www.postgresql.org/message-id/fcalj2acxebwy_bm3kmzekypcxsm+ygitpyhi4fdt6msk6yrt...@mail.gmail.com
"However, it increases the number of iovec initialization with zerobuf
for the cases when pg_pwrite_zeros is called for sizes far greater
than BLCKSZ (for instance, WAL file initialization)."

But it looks like I misunderstood what this quote meant compared to
what v3 does.  It is true that v3 sets iov_len and iov_base more than
needed when writing sizes larger than BLCKSZ.  Seems like you think
that it is not really going to matter much to track which iovecs have
been already initialized during the first loop on
pg_pwritev_with_retry() to keep the code shorter?
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-16 Thread Andres Freund

Hi,

On 2023-02-16 16:58:23 +0900, Michael Paquier wrote:
> On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote:
> > The v3 patch reduces initialization of iovec array elements which is a
> > clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ
> > many times (I assume this is what is needed for the relation extension
> > lock improvements feature). However, it increases the number of iovec
> > initialization with zerobuf for the cases when pg_pwrite_zeros is
> > called for sizes far greater than BLCKSZ (for instance, WAL file
> > initialization).

In those cases the cost of initializing the IOV doesn't matter, relative to
the other costs. The important point is to not initialize a lot of elements if
they're not even needed. Because we need to overwrite the trailing iov
element, it doesn't seem worth to try to "pre-initialize" iov.

Referencing a static variable is more expensive than accessing an on-stack
variable. Having a first-call check is more expensive than not having it.

Thus making the iov and zbuf_sz static isn't helpful. Particularly the latter
seems like a bad idea, because it's a compiler constant.

> It seems to me that v3 would do extra initializations only if
> pg_pwritev_with_retry() does *not* retry its writes, but that's not
> the case as it retries on a partial write as per its name.  The number
> of iov buffers is stricly capped by remaining_size.

I don't really understand this bit?

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-15 Thread Michael Paquier

On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote:
> The v3 patch reduces initialization of iovec array elements which is a
> clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ
> many times (I assume this is what is needed for the relation extension
> lock improvements feature). However, it increases the number of iovec
> initialization with zerobuf for the cases when pg_pwrite_zeros is
> called for sizes far greater than BLCKSZ (for instance, WAL file
> initialization).

It seems to me that v3 would do extra initializations only if
pg_pwritev_with_retry() does *not* retry its writes, but that's not
the case as it retries on a partial write as per its name.  The number
of iov buffers is stricly capped by remaining_size.  FWIW, I find v3
proposed more elegant.

> FWIW, I attached v4 patch, a simplified version of the v2  - it
> initializes all the iovec array elements if the total blocks to be
> written crosses lengthof(iovec array), otherwise it initializes only
> the needed blocks.

+   static size_t   zbuf_sz = BLCKSZ;
In v4, what's the advantage of marking that as static?  It could
actually be dangerous if this is carelessly updated.  Well, that's not
the case, still..
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Bharath Rupireddy

On Wed, Feb 15, 2023 at 6:25 AM Andres Freund  wrote:
>
> > Done, PSA v2 patch.
>
> This feels way too complicated to me.  How about something more like the
> attached?

Thanks. I kind of did cost analysis of v2 and v3:

Input: zero-fill a file of size 256*8K.
v2 patch:
iovec initialization with zerobuf for loop - 1 time
zero-fill 32 blocks at once - 8 times
stack memory - sizeof(PGAlignedBlock) + sizeof(struct iovec) * PG_IOV_MAX

v3 patch:
iovec initialization with zerobuf for loop - 8 times (7 times more
than v2 patch)
zero-fill 32 blocks at once - 8 times (no change from v2 patch)
stack memory - sizeof(PGAlignedBlock) + sizeof(struct iovec) *
PG_IOV_MAX (no change from v2 patch)

The v3 patch reduces initialization of iovec array elements which is a
clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ
many times (I assume this is what is needed for the relation extension
lock improvements feature). However, it increases the number of iovec
initialization with zerobuf for the cases when pg_pwrite_zeros is
called for sizes far greater than BLCKSZ (for instance, WAL file
initialization).

FWIW, I attached v4 patch, a simplified version of the v2  - it
initializes all the iovec array elements if the total blocks to be
written crosses lengthof(iovec array), otherwise it initializes only
the needed blocks.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From a941ea3de79cbe890196fa16f60d1e0797a9b119 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 15 Feb 2023 05:17:22 +
Subject: [PATCH v4] Refactor pg_pwrite_zeros()

This commit refactors pg_pwrite_zeros() as follows:

- add offset parameter
- avoid memset() of zerobuf on every call
- don't initialize the whole IOV array unnecessarily
- don't handle the smaller trailing write in a separate write call

Suggested-by: Andres Freund
Author: Bharath Rupireddy
Discussion: https://www.postgresql.org/message-id/20230211224424.r25uw4rsv6taukxk%40awork3.anarazel.d
---
 src/backend/access/transam/xlog.c  |   2 +-
 src/bin/pg_basebackup/walmethods.c |   2 +-
 src/common/file_utils.c| 102 +++--
 src/include/common/file_utils.h|   2 +-
 4 files changed, 55 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..786b26054c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2981,7 +2981,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
-		rc = pg_pwrite_zeros(fd, wal_segment_size);
+		rc = pg_pwrite_zeros(fd, wal_segment_size, 0);
 
 		if (rc < 0)
 			save_errno = errno;
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index 54014e2b84..6d14b988cb 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -222,7 +222,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	{
 		ssize_t		rc;
 
-		rc = pg_pwrite_zeros(fd, pad_to_size);
+		rc = pg_pwrite_zeros(fd, pad_to_size, 0);
 
 		if (rc < 0)
 		{
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152..303bb07543 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -531,72 +531,74 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 /*
  * pg_pwrite_zeros
  *
- * Writes zeros to file worth "size" bytes, using vectored I/O.
+ * Writes zeros to file worth "size" bytes from starting file offset 'offset',
+ * using vectored I/O.
  *
  * Returns the total amount of data written.  On failure, a negative value
  * is returned with errno set.
+ *
+ * Note that pg_pwrite(), which gets called from this function, may change the
+ * file position (see win32pwrite.c). If needed, callers must deal with it.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size)
+pg_pwrite_zeros(int fd, size_t size, off_t offset)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
-	size_t		zbuffer_sz;
-	struct iovec iov[PG_IOV_MAX];
-	int			blocks;
-	size_t		remaining_size = 0;
-	int			i;
-	ssize_t		written;
-	ssize_t		total_written = 0;
-
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
-
-	/* Prepare to write out a lot of copies of our zero buffer at once. */
-	for (i = 0; i < lengthof(iov); ++i)
+	/*
+	 * Zero-filled buffer worth BLCKSZ. We use static variable to avoid zeroing
+	 * the buffer on every call. Also, with the const qualifier, it is ensured
+	 * that the buffer is placed in read-only section of the process's memory.
+	 */
+	static const PGAlignedBlock zbuf = {0};
+	static size_t	zbuf_sz = BLCKSZ;
+	struct iovec	iov[PG_IOV_MAX];
+	int	blocks;
+	int	init_upto;
+	int	i;
+	ssize_t	ntotalwritten = 0;
+	size_t

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Andres Freund

Hi

On 2023-02-15 10:28:37 +0900, Kyotaro Horiguchi wrote:
> At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund  wrote 
> in 
> > Hi,
> > 
> > On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> > > Done, PSA v2 patch.
> > 
> > This feels way too complicated to me.  How about something more like the
> > attached?
> 
> I like this one, but the parameters offset and size are in a different
> order from pwrite(fd, buf, count, offset).  I perfer the arrangement
> suggested by Bharath.

Yes, it probably is better. Not sure why I went with that order.


> And isn't it better to use Min(remaining_size, BLCKSZ) instead of a bare if
> statement?

I really can't make myself care about which version is better :)

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Kyotaro Horiguchi

At Tue, 14 Feb 2023 16:55:25 -0800, Andres Freund  wrote in 
> Hi,
> 
> On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> > Done, PSA v2 patch.
> 
> This feels way too complicated to me.  How about something more like the
> attached?

I like this one, but the parameters offset and size are in a different
order from pwrite(fd, buf, count, offset).  I perfer the arrangement
suggested by Bharath. And isn't it better to use Min(remaining_size,
BLCKSZ) instead of a bare if statement?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Michael Paquier

On Tue, Feb 14, 2023 at 04:46:07PM -0800, Andres Freund wrote:
> Then it really shouldn't have been named pg_pwrite_zeros(). The point of the
> p{write,read}{,v} family of functions is to be able to specify the offset to
> read/write at. I assume the p is for position, but I'm not sure.

'p' could stand for POSIX, though both read() and pread() are in it.
Anyway, it looks that your guess may be right:
https://stackoverflow.com/questions/17877556/what-does-p-stand-for-in-function-names-pwrite-and-pread

Even there, people don't seem completely sure.
--
Michael

signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-14 18:00:00 +0530, Bharath Rupireddy wrote:
> On Mon, Feb 13, 2023 at 11:09 PM Andres Freund  wrote:
> >
> > > Later code assigns iov[0].iov_len thus we need to provide a separate
> > > iov non-const variable, or can we use pwrite instead there?  (I didn't
> > > find pg_pwrite_with_retry(), though)
> >
> > Given that we need to do that, and given that we already need to loop to
> > handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not
> > worth avoiding iov initialization.
> >
> > But I think it's worth limiting the initialization to blocks.
>
> We can still optimize away the for loop by using a single iovec for
> remaining size, like the attached v2 patch.
>
> > I'd also try to combine the first pg_writev_* with the second one.
>
> Done, PSA v2 patch.

This feels way too complicated to me.  How about something more like the
attached?


> 2) A small test module passing in a file with the size to write isn't
> multiple of block size, meaning, the code we have in the function to
> write last remaining bytes (less than BLCKSZ) gets covered which isn't
> covered right now -

FWIW, I tested this locally by just specifying a smaller size than BLCKSZ for
the write size.

Greetings,

Andres Freund
>From ba0a8697c639870b76e92a285644a87d36cf4496 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 13 Feb 2023 17:11:29 -0800
Subject: [PATCH v3 01/12] Revise pg_pwrite_zeros()

- add offset parameter
- avoid memset() of zerobuf on every call
- don't initialize the whole IOV array unnecessarily
- don't handle the smaller trailing write in a separate write call
---
 src/include/common/file_utils.h|  2 +-
 src/common/file_utils.c| 67 ++
 src/backend/access/transam/xlog.c  |  2 +-
 src/bin/pg_basebackup/walmethods.c |  2 +-
 4 files changed, 26 insertions(+), 47 deletions(-)

diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index bda6d3a5413..95b5cc8b3a1 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -44,6 +44,6 @@ extern ssize_t pg_pwritev_with_retry(int fd,
 	 int iovcnt,
 	 off_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size);
+extern ssize_t pg_pwrite_zeros(int fd, off_t offset, size_t size);
 
 #endif			/* FILE_UTILS_H */
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152f..a161b1c7bbe 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -537,62 +537,41 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size)
+pg_pwrite_zeros(int fd, off_t offset, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
-	size_t		zbuffer_sz;
+	const static PGAlignedBlock zbuffer = {0};		/* worth BLCKSZ */
+	void *zerobuf_addr = unconstify(PGAlignedBlock *, )->data;
 	struct iovec iov[PG_IOV_MAX];
-	int			blocks;
-	size_t		remaining_size = 0;
-	int			i;
-	ssize_t		written;
+	size_t		remaining_size = size;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
-
-	/* Prepare to write out a lot of copies of our zero buffer at once. */
-	for (i = 0; i < lengthof(iov); ++i)
-	{
-		iov[i].iov_base = zbuffer.data;
-		iov[i].iov_len = zbuffer_sz;
-	}
-
 	/* Loop, writing as many blocks as we can for each system call. */
-	blocks = size / zbuffer_sz;
-	remaining_size = size % zbuffer_sz;
-	for (i = 0; i < blocks;)
+	while (remaining_size > 0)
 	{
-		int			iovcnt = Min(blocks - i, lengthof(iov));
-		off_t		offset = i * zbuffer_sz;
-
-		written = pg_pwritev_with_retry(fd, iov, iovcnt, offset);
-
-		if (written < 0)
-			return written;
-
-		i += iovcnt;
-		total_written += written;
-	}
-
-	/* Now, write the remaining size, if any, of the file with zeros. */
-	if (remaining_size > 0)
-	{
-		/* We'll never write more than one block here */
-		int			iovcnt = 1;
-
-		/* Jump on to the end of previously written blocks */
-		off_t		offset = i * zbuffer_sz;
-
-		iov[0].iov_len = remaining_size;
+		int			iovcnt = 0;
+		ssize_t		written;
+
+		for (; iovcnt < PG_IOV_MAX && remaining_size > 0; iovcnt++)
+		{
+			size_t this_iov_size;
+
+			iov[iovcnt].iov_base = zerobuf_addr;
+
+			if (remaining_size < BLCKSZ)
+this_iov_size = remaining_size;
+			else
+this_iov_size = BLCKSZ;
+
+			iov[iovcnt].iov_len = this_iov_size;
+			remaining_size -= this_iov_size;
+		}
 
 		written = pg_pwritev_with_retry(fd, iov, iovcnt, offset);
 
 		if (written < 0)
 			return written;
 
+		offset += written;
 		total_written += written;
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d1..cffe72e2386 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2981,7 +2981,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Andres Freund

Hi,

On 2023-02-14 16:06:24 +0900, Michael Paquier wrote:
> On Mon, Feb 13, 2023 at 05:10:56PM -0800, Andres Freund wrote:
> > I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have 
> > an
> > offset parameter.  Huh, what lead to the function being so constrained?
> 
> Its current set of uses cases, where we only use it now to initialize
> with zeros with WAL segments.  If you have a case that plans to use
> that stuff with an offset, no problem with me.

Then it really shouldn't have been named pg_pwrite_zeros(). The point of the
p{write,read}{,v} family of functions is to be able to specify the offset to
read/write at. I assume the p is for position, but I'm not sure.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-14 Thread Bharath Rupireddy

On Mon, Feb 13, 2023 at 11:09 PM Andres Freund  wrote:
>
> > Later code assigns iov[0].iov_len thus we need to provide a separate
> > iov non-const variable, or can we use pwrite instead there?  (I didn't
> > find pg_pwrite_with_retry(), though)
>
> Given that we need to do that, and given that we already need to loop to
> handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not
> worth avoiding iov initialization.
>
> But I think it's worth limiting the initialization to blocks.

We can still optimize away the for loop by using a single iovec for
remaining size, like the attached v2 patch.

> I'd also try to combine the first pg_writev_* with the second one.

Done, PSA v2 patch.

On Tue, Feb 14, 2023 at 6:40 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-02-12 09:31:36 -0800, Andres Freund wrote:
> > Another thing: I think we should either avoid iterating over all the IOVs if
> > we don't need them, or, even better, initialize the array as a constant, 
> > once.
>
> I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an
> offset parameter.  Huh, what lead to the function being so constrained?

Done, PSA v2 patch.

We could do few more things, but honestly I feel they're unnecessary:
1) An assert-only code that checks if the asked file contents are
zeroed at the end of pg_pwrite_zeros (to be more defensive, but
reading 16MB files and checking if it's zero-filled will surely
slowdown the Assert builds).
2) A small test module passing in a file with the size to write isn't
multiple of block size, meaning, the code we have in the function to
write last remaining bytes (less than BLCKSZ) gets covered which isn't
covered right now -
https://coverage.postgresql.org/src/common/file_utils.c.gcov.html.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 4fd58060fa3fc301dff2b5fb38c105ad91bc8be4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 14 Feb 2023 10:12:34 +
Subject: [PATCH v2] Refactor pg_pwrite_zeros()

This commit changes pg_pwrite_zeros() as follows:

1. Optimize zero buffer handling in pg_pwrite_zeros() by using static
const zero-filled buffer to avoid memset-ing on every call.
2. Optimize zero buffer handling in pg_pwrite_zeros() by using static
iovec and point the zero-filled buffer for the first time through to
avoid for loop initializing iovec on every call.
3. Add offset parameter to pg_pwrite_zeros() to let callers specify the
file position from where to zero-fill.
4. Combine the logic writing the last remaining bytes (less than one
full block) into main for loop to avoid code duplication.

Suggested-by: Andres Freund
Author: Bharath Rupireddy
Discussion: https://www.postgresql.org/message-id/20230211224424.r25uw4rsv6taukxk%40awork3.anarazel.d
---
 src/backend/access/transam/xlog.c  |   2 +-
 src/bin/pg_basebackup/walmethods.c |   2 +-
 src/common/file_utils.c| 125 ++---
 src/include/common/file_utils.h|   2 +-
 4 files changed, 82 insertions(+), 49 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..786b26054c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2981,7 +2981,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
-		rc = pg_pwrite_zeros(fd, wal_segment_size);
+		rc = pg_pwrite_zeros(fd, wal_segment_size, 0);
 
 		if (rc < 0)
 			save_errno = errno;
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index 54014e2b84..6d14b988cb 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -222,7 +222,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	{
 		ssize_t		rc;
 
-		rc = pg_pwrite_zeros(fd, pad_to_size);
+		rc = pg_pwrite_zeros(fd, pad_to_size, 0);
 
 		if (rc < 0)
 		{
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152..7faa7a72f6 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -531,72 +531,105 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 /*
  * pg_pwrite_zeros
  *
- * Writes zeros to file worth "size" bytes, using vectored I/O.
+ * Writes zeros to file worth "size" bytes from starting file offset 'offset',
+ * using vectored I/O.
  *
  * Returns the total amount of data written.  On failure, a negative value
  * is returned with errno set.
+ *
+ * Note that pg_pwrite(), which gets called from this function, may change the
+ * file position (see win32pwrite.c). If needed, callers must deal with it.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size)
+pg_pwrite_zeros(int fd, size_t size, off_t offset)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
-	size_t		zbuffer_sz;
-	struct

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-13 Thread Michael Paquier

On Mon, Feb 13, 2023 at 05:10:56PM -0800, Andres Freund wrote:
> I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an
> offset parameter.  Huh, what lead to the function being so constrained?

Its current set of uses cases, where we only use it now to initialize
with zeros with WAL segments.  If you have a case that plans to use
that stuff with an offset, no problem with me. 
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-13 Thread Andres Freund

Hi,

On 2023-02-12 09:31:36 -0800, Andres Freund wrote:
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

I just tried to use pg_pwrite_zeros - and couldn't because it doesn't have an
offset parameter.  Huh, what lead to the function being so constrained?

- Andres

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-13 Thread Andres Freund

On 2023-02-13 18:33:34 +0900, Kyotaro Horiguchi wrote:
> At Mon, 13 Feb 2023 10:15:03 +0530, Bharath Rupireddy 
>  wrote in 
> > On Sun, Feb 12, 2023 at 11:01 PM Andres Freund  wrote:
> > >
> > > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> > > >   /* Prepare to write out a lot of copies of our zero buffer at 
> > > > once. */
> > > >   for (i = 0; i < lengthof(iov); ++i)
> > > >   {
> > > > - iov[i].iov_base = zbuffer.data;
> > > > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, 
> > > > )->data);
> > > >   iov[i].iov_len = zbuffer_sz;
> > > >   }
> > >
> > > Another thing: I think we should either avoid iterating over all the IOVs 
> > > if
> > > we don't need them, or, even better, initialize the array as a constant, 
> > > once.
> 
> FWIW, I tried to use the "{[start .. end] = {}}" trick (GNU extension?
> [*1]) for constant array initialization, but individual members don't
> accept assigning a const value, thus I did deconstify as the follows.
> 
> > static const struct iovec   iov[PG_IOV_MAX] =
> > {[0 ... PG_IOV_MAX - 1] =
> >  {
> >  .iov_base = (void *),
> >  .iov_len = BLCKSZ
> >  }
> > };
> 
> I didn't checked the actual mapping, but if I tried an assignment
> "iov[0].iov_base = NULL", it failed as "assignment of member
> ‘iov_base’ in read-only object", so is it successfully placed in a
> read-only segment?
> 
> Later code assigns iov[0].iov_len thus we need to provide a separate
> iov non-const variable, or can we use pwrite instead there?  (I didn't
> find pg_pwrite_with_retry(), though)

Given that we need to do that, and given that we already need to loop to
handle writes that are longer than PG_IOV_MAX * BLCKSZ, it's probably not
worth avoiding iov initialization.

But I think it's worth limiting the initialization to blocks.

I'd also try to combine the first pg_writev_* with the second one.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-13 Thread Kyotaro Horiguchi

At Mon, 13 Feb 2023 10:15:03 +0530, Bharath Rupireddy 
 wrote in 
> On Sun, Feb 12, 2023 at 11:01 PM Andres Freund  wrote:
> >
> > On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> > >   /* Prepare to write out a lot of copies of our zero buffer at once. 
> > > */
> > >   for (i = 0; i < lengthof(iov); ++i)
> > >   {
> > > - iov[i].iov_base = zbuffer.data;
> > > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, 
> > > )->data);
> > >   iov[i].iov_len = zbuffer_sz;
> > >   }
> >
> > Another thing: I think we should either avoid iterating over all the IOVs if
> > we don't need them, or, even better, initialize the array as a constant, 
> > once.

FWIW, I tried to use the "{[start .. end] = {}}" trick (GNU extension?
[*1]) for constant array initialization, but individual members don't
accept assigning a const value, thus I did deconstify as the follows.

>   static const struct iovec   iov[PG_IOV_MAX] =
>   {[0 ... PG_IOV_MAX - 1] =
>{
>.iov_base = (void *),
>.iov_len = BLCKSZ
>}
>   };

I didn't checked the actual mapping, but if I tried an assignment
"iov[0].iov_base = NULL", it failed as "assignment of member
‘iov_base’ in read-only object", so is it successfully placed in a
read-only segment?

Later code assigns iov[0].iov_len thus we need to provide a separate
iov non-const variable, or can we use pwrite instead there?  (I didn't
find pg_pwrite_with_retry(), though)

> How about like the attached patch? It makes the iovec static variable
> and points the zero buffer only once/for the first time to iovec. This
> avoids for-loop on every call.

As the patch itself, it seems forgetting to reset iov[0].iov_len after
writing a partial block.

retards.

*1: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Designated-Inits.html

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-12 Thread Bharath Rupireddy

On Sun, Feb 12, 2023 at 11:01 PM Andres Freund  wrote:
>
> On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> > Thanks for looking at it. We know that we don't change the zbuffer in
> > the function, so can we avoid static const and have just a static
> > variable, like the attached
> > v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
> > you see any problem with it?
>
> Making it static const is better, because it allows the memory for the
> variable to be put in a readonly section.

Okay.

> >   /* Prepare to write out a lot of copies of our zero buffer at once. */
> >   for (i = 0; i < lengthof(iov); ++i)
> >   {
> > - iov[i].iov_base = zbuffer.data;
> > + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, 
> > )->data);
> >   iov[i].iov_len = zbuffer_sz;
> >   }
>
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

How about like the attached patch? It makes the iovec static variable
and points the zero buffer only once/for the first time to iovec. This
avoids for-loop on every call.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 550b606affb614980a4b814d9e44e73d5489 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 13 Feb 2023 04:25:34 +
Subject: [PATCH v1] Optimize zero buffer handling in pg_pwrite_zeros()

1. Use static const zero-filled buffer to avoid memset-ing on
every call.
2. Use static iovec and point the zero-filled buffer for the
first time through to avoid for loop initializing iovec on every
call.

Reported-by: Andres Freund
Author: Bharath Rupireddy
Discussion: https://www.postgresql.org/message-id/20230211224424.r25uw4rsv6taukxk%40awork3.anarazel.de
---
 src/common/file_utils.c | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152..08688eb1b1 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,25 +539,35 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
-	size_t		zbuffer_sz;
-	struct iovec iov[PG_IOV_MAX];
+	/*
+	 * Zero-filled buffer worth BLCKSZ. We use static variable to avoid zeroing
+	 * the buffer on every call. Also, with the const qualifier, it is ensured
+	 * that the buffer is placed in read-only section of the process's memory.
+	 */
+	static const PGAlignedBlock zbuffer = {0};
+	static size_t	zbuffer_sz = BLCKSZ;
+	static bool	first_time = true;
+	static struct iovec	iov[PG_IOV_MAX] = {0};
 	int			blocks;
 	size_t		remaining_size = 0;
 	int			i;
 	ssize_t		written;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
-
-	/* Prepare to write out a lot of copies of our zero buffer at once. */
-	for (i = 0; i < lengthof(iov); ++i)
+	/*
+	 * If first time through, point the zero buffer to iovec structure. This
+	 * avoids for loop on every call.
+	 */
+	if (first_time)
 	{
-		iov[i].iov_base = zbuffer.data;
-		iov[i].iov_len = zbuffer_sz;
+		/* Prepare to write out a lot of copies of our zero buffer at once. */
+		for (i = 0; i < lengthof(iov); ++i)
+		{
+			iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, )->data);
+			iov[i].iov_len = zbuffer_sz;
+		}
+
+		first_time = false;
 	}
 
 	/* Loop, writing as many blocks as we can for each system call. */
-- 
2.34.1

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-12 Thread Michael Paquier

On Sun, Feb 12, 2023 at 09:31:36AM -0800, Andres Freund wrote:
> Another thing: I think we should either avoid iterating over all the IOVs if
> we don't need them, or, even better, initialize the array as a constant, once.

Where you imply that we'd still use memset() once on iov[PG_IOV_MAX],
right?
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-12 Thread Andres Freund

Hi,

On 2023-02-12 19:59:00 +0530, Bharath Rupireddy wrote:
> Thanks for looking at it. We know that we don't change the zbuffer in
> the function, so can we avoid static const and have just a static
> variable, like the attached
> v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
> you see any problem with it?

Making it static const is better, because it allows the memory for the
variable to be put in a readonly section.


>   /* Prepare to write out a lot of copies of our zero buffer at once. */
>   for (i = 0; i < lengthof(iov); ++i)
>   {
> - iov[i].iov_base = zbuffer.data;
> + iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, 
> )->data);
>   iov[i].iov_len = zbuffer_sz;
>   }

Another thing: I think we should either avoid iterating over all the IOVs if
we don't need them, or, even better, initialize the array as a constant, once.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-12 Thread Bharath Rupireddy

On Sun, Feb 12, 2023 at 4:14 AM Andres Freund  wrote:
>
> > +ssize_t
> > +pg_pwrite_zeros(int fd, size_t size)
> > +{
> > + PGAlignedBlock  zbuffer;
> > + size_t  zbuffer_sz;
> > + struct ioveciov[PG_IOV_MAX];
> > + int blocks;
> > + size_t  remaining_size = 0;
> > + int i;
> > + ssize_t written;
> > + ssize_t total_written = 0;
> > +
> > + zbuffer_sz = sizeof(zbuffer.data);
> > +
> > + /* Zero-fill the buffer. */
> > + memset(zbuffer.data, 0, zbuffer_sz);
>
> I previously commented on this - why are we memseting a buffer on every call
> to this?  That's not at all free.
>
> Something like
> static const PGAlignedBlock zerobuf = {0};
> would do the trick.  You do need to cast the const away, to assign to
> iov_base, but that's not too ugly.

Thanks for looking at it. We know that we don't change the zbuffer in
the function, so can we avoid static const and have just a static
variable, like the attached
v1-0001-Use-static-variable-to-avoid-memset-calls-in-pg_p.patch? Do
you see any problem with it?

FWIW, it comes out like the attached
v1-0001-Use-static-const-variable-to-avoid-memset-calls-i.patch with
static const.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 2527c6083bbdf006900be1a1dbf15ed815184d2b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 12 Feb 2023 10:58:33 +
Subject: [PATCH v1] Use static variable to avoid memset() calls in
 pg_pwrite_zeros()

---
 src/common/file_utils.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152..0d59488f60 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,7 +539,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
+	/* Zero-filled buffer worth BLCKSZ. */
+	static PGAlignedBlock zbuffer = {0};
 	size_t		zbuffer_sz;
 	struct iovec iov[PG_IOV_MAX];
 	int			blocks;
@@ -548,10 +549,7 @@ pg_pwrite_zeros(int fd, size_t size)
 	ssize_t		written;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
+	zbuffer_sz = BLCKSZ;
 
 	/* Prepare to write out a lot of copies of our zero buffer at once. */
 	for (i = 0; i < lengthof(iov); ++i)
-- 
2.34.1

From adc5ea83fb6726a5d8a1f3b87d6b3b03550c7fcb Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 12 Feb 2023 10:55:09 +
Subject: [PATCH v1] Use static const variable to avoid memset calls in
 pg_pwrite_zeros

---
 src/common/file_utils.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152..cd15cc2fdb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -539,7 +539,8 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 ssize_t
 pg_pwrite_zeros(int fd, size_t size)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
+	/* Zero-filled buffer worth BLCKSZ. */
+	static const PGAlignedBlock zbuffer = {0};
 	size_t		zbuffer_sz;
 	struct iovec iov[PG_IOV_MAX];
 	int			blocks;
@@ -548,15 +549,12 @@ pg_pwrite_zeros(int fd, size_t size)
 	ssize_t		written;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
+	zbuffer_sz = BLCKSZ;
 
 	/* Prepare to write out a lot of copies of our zero buffer at once. */
 	for (i = 0; i < lengthof(iov); ++i)
 	{
-		iov[i].iov_base = zbuffer.data;
+		iov[i].iov_base = (void *) (unconstify(PGAlignedBlock *, )->data);
 		iov[i].iov_len = zbuffer_sz;
 	}
 
-- 
2.34.1

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-11 Thread Andres Freund

Hi,

On 2022-11-01 08:32:48 +0530, Bharath Rupireddy wrote:
> +/*
> + * pg_pwrite_zeros
> + *
> + * Writes zeros to a given file. Input parameters are "fd" (file descriptor 
> of
> + * the file), "size" (size of the file in bytes).
> + *
> + * On failure, a negative value is returned and errno is set appropriately so
> + * that the caller can use it accordingly.
> + */
> +ssize_t
> +pg_pwrite_zeros(int fd, size_t size)
> +{
> + PGAlignedBlock  zbuffer;
> + size_t  zbuffer_sz;
> + struct ioveciov[PG_IOV_MAX];
> + int blocks;
> + size_t  remaining_size = 0;
> + int i;
> + ssize_t written;
> + ssize_t total_written = 0;
> +
> + zbuffer_sz = sizeof(zbuffer.data);
> +
> + /* Zero-fill the buffer. */
> + memset(zbuffer.data, 0, zbuffer_sz);

I previously commented on this - why are we memseting a buffer on every call
to this?  That's not at all free.

Something like
static const PGAlignedBlock zerobuf = {0};
would do the trick.  You do need to cast the const away, to assign to
iov_base, but that's not too ugly.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-11 Thread John Naylor

On Fri, Nov 11, 2022 at 2:12 PM Michael Paquier  wrote:
>
> On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote:
> > On Fri, Nov 11, 2022 at 11:14 AM John Naylor <
john.nay...@enterprisedb.com> wrote:
> >> Was there supposed to be an attachment here?
> >
> > Nope. The patches have already been committed -
> > 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
> > 28cc2976a9cf0ed661dbc55f49f669192cce1c89.
>
> The committed patches are pretty much the same as the last version
> sent on this thread, except that the changes have been split across
> the files they locally impact, with a few simplifications tweaks to
> the comments.  Hencem I did not see any need to send a new version for
> this case.

Ah, I missed that -- sorry for the noise.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-10 Thread Michael Paquier

On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote:
> On Fri, Nov 11, 2022 at 11:14 AM John Naylor  
> wrote:
>> Was there supposed to be an attachment here?
> 
> Nope. The patches have already been committed -
> 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
> 28cc2976a9cf0ed661dbc55f49f669192cce1c89.

The committed patches are pretty much the same as the last version
sent on this thread, except that the changes have been split across
the files they locally impact, with a few simplifications tweaks to
the comments.  Hencem I did not see any need to send a new version for
this case.
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-10 Thread Bharath Rupireddy

On Fri, Nov 11, 2022 at 11:14 AM John Naylor
 wrote:
>
> On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier  wrote:
> >
> > So, I have looked at that, and at the end concluded that Andres'
> > suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
> > in the long run.  Thomas has mentioned upthread that some of the
> > comments don't need to be that long, so I have tweaked these to be
> > minimal, and updated a few more areas.  Note that this has been split
> > into two commits: one to introduce the new routine in file_utils.c and
> > a second for the switch in walmethods.c.
>
> Was there supposed to be an attachment here?

Nope. The patches have already been committed -
3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
28cc2976a9cf0ed661dbc55f49f669192cce1c89.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-10 Thread John Naylor

On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier  wrote:
>
> So, I have looked at that, and at the end concluded that Andres'
> suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
> in the long run.  Thomas has mentioned upthread that some of the
> comments don't need to be that long, so I have tweaked these to be
> minimal, and updated a few more areas.  Note that this has been split
> into two commits: one to introduce the new routine in file_utils.c and
> a second for the switch in walmethods.c.

Was there supposed to be an attachment here?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-07 Thread Michael Paquier

On Tue, Nov 01, 2022 at 08:32:48AM +0530, Bharath Rupireddy wrote:
> I'm attaching the v9 patch from upthread here again for further review
> and to make CF bot happy.

So, I have looked at that, and at the end concluded that Andres'
suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
in the long run.  Thomas has mentioned upthread that some of the
comments don't need to be that long, so I have tweaked these to be
minimal, and updated a few more areas.  Note that this has been split
into two commits: one to introduce the new routine in file_utils.c and
a second for the switch in walmethods.c.
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-31 Thread Bharath Rupireddy

On Mon, Oct 31, 2022 at 11:50 AM Bharath Rupireddy
 wrote:
>
> On Mon, Oct 31, 2022 at 5:01 AM Michael Paquier  wrote:
> >
> > On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote:
> > > So I'm kinda proposing that we only do the forward struct initialization
> > > dance when it really saves on things -- in particular, when it helps
> > > avoid or reduce massive indirect header inclusion.
> >
> > Sure.
>
> I don't think including pg_iovec.h in file_utils.h is a good idea. I
> agree that pg_iovec.h is fairly a small header file but file_utils.h
> is included in 21 .c files, as of today and the file_utils.h footprint
> might increase in future. Therefore, I'd vote for forward struct
> initialization as it is on HEAD today.

I'm attaching the v9 patch from upthread here again for further review
and to make CF bot happy.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v9-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch
Description: Binary data

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-31 Thread Bharath Rupireddy

On Mon, Oct 31, 2022 at 5:01 AM Michael Paquier  wrote:
>
> On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote:
> > So I'm kinda proposing that we only do the forward struct initialization
> > dance when it really saves on things -- in particular, when it helps
> > avoid or reduce massive indirect header inclusion.
>
> Sure.

I don't think including pg_iovec.h in file_utils.h is a good idea. I
agree that pg_iovec.h is fairly a small header file but file_utils.h
is included in 21 .c files, as of today and the file_utils.h footprint
might increase in future. Therefore, I'd vote for forward struct
initialization as it is on HEAD today.

> >  extern ssize_t pg_pwritev_with_retry(int fd,
> > - const struct iovec *iov,
> > + const iovec *iov,
> >   int iovcnt,
> >   off_t offset);
>
> However this still needs to be defined as a struct, no?

Yes, we need a struct there because we haven't typedef'ed struct iovec.

Also, the patch forgets to remove "port/pg_iovec.h" from file_utils.c

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-30 Thread Michael Paquier

On Sun, Oct 30, 2022 at 03:44:32PM +0100, Alvaro Herrera wrote:
> So I'm kinda proposing that we only do the forward struct initialization
> dance when it really saves on things -- in particular, when it helps
> avoid or reduce massive indirect header inclusion.

Sure.

>  extern ssize_t pg_pwritev_with_retry(int fd,
> - const struct iovec *iov,
> + const iovec *iov,
>   int iovcnt,
>   off_t offset);

However this still needs to be defined as a struct, no?
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-30 Thread Alvaro Herrera

On 2022-Oct-27, Michael Paquier wrote:

> On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote:
> > Looks reasonable to me.
> 
> 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so
> applied.

Maybe something a bit useless, but while perusing the commits I noticed
a forward struct declaration was moved from one file to another; this is
claimed to avoid including port/pg_iovec.h in common/file_utils.h.  We
do that kind of thing in a few places, but in this particular case it
seems a bit of a pointless exercise, since pg_iovec.h doesn't include
anything else and it's a quite simple header.

So I'm kinda proposing that we only do the forward struct initialization
dance when it really saves on things -- in particular, when it helps
avoid or reduce massive indirect header inclusion.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From c170367e5ac7dca2dcd11b5cbb6d351b1b205842 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Sun, 30 Oct 2022 15:22:24 +0100
Subject: [PATCH] No need to avoid including pg_iovec.h

---
 src/include/common/file_utils.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 2c5dbcb0b1..20fe5806fb 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -15,6 +15,8 @@

 #include 

+#include "port/pg_iovec.h"
+
 typedef enum PGFileType
 {
 	PGFILETYPE_ERROR,
@@ -24,7 +26,6 @@ typedef enum PGFileType
 	PGFILETYPE_LNK
 } PGFileType;

-struct iovec;	/* avoid including port/pg_iovec.h here */

 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir);
@@ -40,7 +41,7 @@ extern PGFileType get_dirent_type(const char *path,
   int elevel);

 extern ssize_t pg_pwritev_with_retry(int fd,
-	 const struct iovec *iov,
+	 const iovec *iov,
 	 int iovcnt,
 	 off_t offset);

-- 
2.30.2

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-29 Thread Bharath Rupireddy

On Sat, Oct 29, 2022 at 3:37 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-28 11:09:38 +0900, Michael Paquier wrote:
> > On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> > > The block sizes don't need to match, do they? As long as the block is 
> > > properly
> > > aligned, we can change the iov_len of the final iov to match whatever the 
> > > size
> > > is being passed in, no?
> >
> > Hmm.  Based on what Bharath has written upthread, it does not seem to
> > matter if the size of the aligned block changes, either:
> > https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com
> >
> > I am honestly not sure whether it is a good idea to make file_utils.c
> > depend on one of the compile-time page sizes in this routine, be it
> > the page size of the WAL page size, as pg_write_zeros() would be used
> > for some rather low-level operations.  But we could as well just use a
> > locally-defined structure with a buffer at 4kB or 8kB and call it a
> > day?
>
> Shrug. I don't think we gain much by having yet another PGAlignedXYZBlock.

Hm. I tend to agree with Andres here, i.e. using PGAlignedBlock is
sufficient. It seems like we are using PGAlignedBlock for heap, index,
history file, fsm, visibility map file pages as well.

Please see the attached v9 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 16ab6ae0b00ebcd6af017299a70bb687c83b2885 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 29 Oct 2022 06:15:36 +
Subject: [PATCH v9] Use pg_pwritev_with_retry() instead of write() in
 walmethods.c

Use pg_pwritev_with_retry() while prepadding a WAL segment
instead of write() in walmethods.c dir_open_for_write() to avoid
partial writes. As the pg_pwritev_with_retry() function uses
pwritev, we can avoid explicit lseek() on non-Windows platforms
to seek to the beginning of the WAL segment. It looks like on
Windows, we need an explicit lseek() call here despite using
pwrite() implementation from win32pwrite.c. Otherwise
an error occurs.

These changes are inline with how core postgres initializes the
WAL segment in XLogFileInitInternal().

Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart, Michael Paquier
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c  | 33 ++--
 src/bin/pg_basebackup/walmethods.c | 26 +-
 src/common/file_utils.c| 80 ++
 src/include/common/file_utils.h|  3 ++
 4 files changed, 102 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..15dd5ce834 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 bool *added, char *path)
 {
 	char		tmppath[MAXPGPATH];
-	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
@@ -2965,14 +2964,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
-	memset(zbuffer.data, 0, XLOG_BLCKSZ);
-
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
 	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
+		ssize_t rc;
 
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
@@ -2983,29 +2979,10 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
+		rc = pg_pwrite_zeros(fd, wal_segment_size);
 
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
-		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
-		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
-		{
-			int			iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
-
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
-			{
-save_errno = errno;
-break;
-			}
-
-			i += iovcnt;
-		}
+		if (rc < 0)
+			save_errno = errno;
 	}
 	else
 	{
@@ -3014,7 +2991,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * enough.
 		 */
 		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
+		if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1)
 		{
 			/* if write didn't set errno, assume no disk space */
 			save_errno = errno ? errno : ENOSPC;
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-28 Thread Andres Freund

Hi,

On 2022-10-28 11:09:38 +0900, Michael Paquier wrote:
> On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> > The block sizes don't need to match, do they? As long as the block is 
> > properly
> > aligned, we can change the iov_len of the final iov to match whatever the 
> > size
> > is being passed in, no?
> 
> Hmm.  Based on what Bharath has written upthread, it does not seem to
> matter if the size of the aligned block changes, either:
> https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com
> 
> I am honestly not sure whether it is a good idea to make file_utils.c
> depend on one of the compile-time page sizes in this routine, be it
> the page size of the WAL page size, as pg_write_zeros() would be used
> for some rather low-level operations.  But we could as well just use a
> locally-defined structure with a buffer at 4kB or 8kB and call it a
> day?

Shrug. I don't think we gain much by having yet another PGAlignedXYZBlock.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-28 Thread Michael Paquier

On Fri, Oct 28, 2022 at 11:38:51AM +0530, Bharath Rupireddy wrote:
> +1. Please see the attached v8 patch.

+   chardata[PG_WRITE_BLCKSZ];
+   double  force_align_d;
+   int64   force_align_i64;
+} PGAlignedWriteBlock;
I have not checked in details, but that should do the job.

Andres, Thomas, Nathan, perhaps you have a different view on the
matter?
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-28 Thread Bharath Rupireddy

On Fri, Oct 28, 2022 at 7:39 AM Michael Paquier  wrote:
>
> On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> > The block sizes don't need to match, do they? As long as the block is
properly
> > aligned, we can change the iov_len of the final iov to match whatever
the size
> > is being passed in, no?
>
> Hmm.  Based on what Bharath has written upthread, it does not seem to
> matter if the size of the aligned block changes, either:
>
https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com
>
> I am honestly not sure whether it is a good idea to make file_utils.c
> depend on one of the compile-time page sizes in this routine, be it
> the page size of the WAL page size, as pg_write_zeros() would be used
> for some rather low-level operations.  But we could as well just use a
> locally-defined structure with a buffer at 4kB or 8kB and call it a
> day?

+1. Please see the attached v8 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 88d67c6e3e92a62ac4c783a137759b2f3795cb21 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 28 Oct 2022 04:44:41 +
Subject: [PATCH v8] Use pg_pwritev_with_retry() instead of write() in
 walmethods.c

Use pg_pwritev_with_retry() while prepadding a WAL segment
instead of write() in walmethods.c dir_open_for_write() to avoid
partial writes. As the pg_pwritev_with_retry() function uses
pwritev, we can avoid explicit lseek() on non-Windows platforms
to seek to the beginning of the WAL segment. It looks like on
Windows, we need an explicit lseek() call here despite using
pwrite() implementation from win32pwrite.c. Otherwise
an error occurs.

These changes are inline with how core postgres initializes the
WAL segment in XLogFileInitInternal().

Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart, Michael Paquier
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c  | 33 ++-
 src/bin/pg_basebackup/walmethods.c | 26 +
 src/common/file_utils.c| 92 ++
 src/include/common/file_utils.h|  3 +
 4 files changed, 114 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..15dd5ce834 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 bool *added, char *path)
 {
 	char		tmppath[MAXPGPATH];
-	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
@@ -2965,14 +2964,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
-	memset(zbuffer.data, 0, XLOG_BLCKSZ);
-
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
 	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
+		ssize_t rc;
 
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
@@ -2983,29 +2979,10 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
+		rc = pg_pwrite_zeros(fd, wal_segment_size);
 
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
-		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
-		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
-		{
-			int			iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
-
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
-			{
-save_errno = errno;
-break;
-			}
-
-			i += iovcnt;
-		}
+		if (rc < 0)
+			save_errno = errno;
 	}
 	else
 	{
@@ -3014,7 +2991,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * enough.
 		 */
 		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
+		if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1)
 		{
 			/* if write didn't set errno, assume no disk space */
 			save_errno = errno ? errno : ENOSPC;
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index bc2e83d02b..18bad52c96 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -220,22 +220,24 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	/* Do pre-padding on non-compressed files */
 	if (pad_to_size && wwmethod->compression_algorithm == PG_COMPRESSION_NONE)
 	{
-		PGAlignedXLogBlock zerobuf;
-		int			bytes;
+		ssize_t rc;

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-27 Thread Michael Paquier

On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> The block sizes don't need to match, do they? As long as the block is properly
> aligned, we can change the iov_len of the final iov to match whatever the size
> is being passed in, no?

Hmm.  Based on what Bharath has written upthread, it does not seem to
matter if the size of the aligned block changes, either:
https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com

I am honestly not sure whether it is a good idea to make file_utils.c
depend on one of the compile-time page sizes in this routine, be it
the page size of the WAL page size, as pg_write_zeros() would be used
for some rather low-level operations.  But we could as well just use a
locally-defined structure with a buffer at 4kB or 8kB and call it a
day?
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-27 Thread Andres Freund

Hi,

Interestingly, I also needed something like pg_pwrite_zeros() today. Exposed
via smgr, for more efficient relation extensions.

On 2022-10-27 14:54:00 +0900, Michael Paquier wrote:
> Regarding 0002, using pg_pwrite_zeros() as a routine name, as
> suggested by Thomas, sounds good to me.  However, I am not really a
> fan of its dependency with PGAlignedXLogBlock, because it should be
> able to work with any buffers of any sizes, as long as the input
> buffer is aligned, shouldn't it?  For example, what about
> PGAlignedBlock?  So, should we make this more extensible?  My guess
> would be the addition of the block size and the block pointer to the
> arguments of pg_pwrite_zeros(), in combination with a check to make
> sure that the input buffer is MAXALIGN()'d (with an Assert() rather
> than just an elog/pg_log_error?).

I don't like passing in the buffer. That leads to code like in Bharat's latest
version, where we now zero that buffer on every invocation of
pg_pwrite_zeros() - not at all cheap. And every caller has to have provisions
to provide that buffer.

The block sizes don't need to match, do they? As long as the block is properly
aligned, we can change the iov_len of the final iov to match whatever the size
is being passed in, no?

Why don't we define a

static PGAlignedBlock source_of_zeroes;

in file_utils.c, and use that in pg_pwrite_zeros(), being careful to set the
iov_len arguments correctly?

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-27 Thread Bharath Rupireddy

On Thu, Oct 27, 2022 at 11:24 AM Michael Paquier  wrote:
>
> On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote:
> > Looks reasonable to me.
>
> 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so
> applied.

Thanks.

> Regarding 0002, using pg_pwrite_zeros() as a routine name, as
> suggested by Thomas, sounds good to me.

Changed.

> However, I am not really a
> fan of its dependency with PGAlignedXLogBlock, because it should be
> able to work with any buffers of any sizes, as long as the input
> buffer is aligned, shouldn't it?  For example, what about
> PGAlignedBlock?  So, should we make this more extensible? My guess
> would be the addition of the block size and the block pointer to the
> arguments of pg_pwrite_zeros(), in combination with a check to make
> sure that the input buffer is MAXALIGN()'d (with an Assert() rather
> than just an elog/pg_log_error?).

+1 to pass in the aligned buffer, its size and an assertion on the buffer size.

Please see the attached v7 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 1d81929de5f5b56d39fc7f1273cdb4c1e36337bb Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 27 Oct 2022 08:51:07 +
Subject: [PATCH v7] Use pg_pwritev_with_retry() instead of write() in
 walmethods.c

Use pg_pwritev_with_retry() while prepadding a WAL segment
instead of write() in walmethods.c dir_open_for_write() to avoid
partial writes. As the pg_pwritev_with_retry() function uses
pwritev, we can avoid explicit lseek() on non-Windows platforms
to seek to the beginning of the WAL segment. It looks like on
Windows, we need an explicit lseek() call here despite using
pwrite() implementation from win32pwrite.c. Otherwise
an error occurs.

These changes are inline with how core postgres initializes the
WAL segment in XLogFileInitInternal().

Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart, Michael Paquier
Reviewed-by: Thomas Munro
Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c  | 35 +++---
 src/bin/pg_basebackup/walmethods.c | 28 ++-
 src/common/file_utils.c| 77 ++
 src/include/common/file_utils.h|  5 ++
 4 files changed, 105 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..638d6d448b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 bool *added, char *path)
 {
 	char		tmppath[MAXPGPATH];
-	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
@@ -2965,14 +2964,12 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
-	memset(zbuffer.data, 0, XLOG_BLCKSZ);
-
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
 	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
+		PGAlignedXLogBlock zbuffer;
+		ssize_t rc;
 
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
@@ -2983,29 +2980,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
+		rc = pg_pwrite_zeros(fd, wal_segment_size, zbuffer.data,
+			 sizeof(zbuffer.data));
 
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
-		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
-		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
-		{
-			int			iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
-
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
-			{
-save_errno = errno;
-break;
-			}
-
-			i += iovcnt;
-		}
+		if (rc < 0)
+			save_errno = errno;
 	}
 	else
 	{
@@ -3014,7 +2993,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * enough.
 		 */
 		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
+		if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1)
 		{
 			/* if write didn't set errno, assume no disk space */
 			save_errno = errno ? errno : ENOSPC;
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index bc2e83d02b..f7af73c01b 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -220,22 +220,26 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	/* Do pre-padding on non-compressed files */
 	if (pad_to_size

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-10-26 Thread Michael Paquier

On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote:
> Looks reasonable to me.

0001, to move pg_pwritev_with_retry() to a new home, seems fine, so
applied.

Regarding 0002, using pg_pwrite_zeros() as a routine name, as
suggested by Thomas, sounds good to me.  However, I am not really a
fan of its dependency with PGAlignedXLogBlock, because it should be
able to work with any buffers of any sizes, as long as the input
buffer is aligned, shouldn't it?  For example, what about
PGAlignedBlock?  So, should we make this more extensible?  My guess
would be the addition of the block size and the block pointer to the 
arguments of pg_pwrite_zeros(), in combination with a check to make
sure that the input buffer is MAXALIGN()'d (with an Assert() rather
than just an elog/pg_log_error?).
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Nathan Bossart

On Fri, Sep 30, 2022 at 08:09:04AM +0530, Bharath Rupireddy wrote:
> I'm attaching the v6 patch set, please review it further.

Looks reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Bharath Rupireddy

On Fri, Sep 30, 2022 at 6:46 AM Thomas Munro  wrote:
>
> +1 for just doing it always, with a one-liner comment like
> "pg_pwritev*() might move the file position".  No reason to spam the
> source tree with more explanations of the exact reason.

+1 for resetting the file position in a platform-independent manner.
But, a description there won't hurt IMO and it saves time for the
hackers who spend time there and think why it's that way.

> If someone
> ever comes up with another case where p- and non-p- I/O functions are
> intermixed and it's really worth saving a system call (don't get me
> wrong, we call lseek() an obscene amount elsewhere and I'd like to fix
> that, but this case isn't hot?) then I like your idea of a macro to
> tell you whether you need to.

I don't think we go that route as the code isn't a hot path and an
extra system call wouldn't hurt performance much, a comment there
should work.

> Earlier I wondered why we'd want to include "pg_pwritev" in the name
> of this zero-filling function (pwritev being an internal
> implementation detail), but now it seems like maybe a good idea
> because it highlights the file position portability problem by being a
> member of that family of similarly-named functions.

Hm.

On Thu, Sep 29, 2022 at 10:57 PM Nathan Bossart
 wrote:
>
> +iov[0].iov_base = zbuffer.data;
>
> This seems superfluous, but I don't think it's hurting anything.

Yes, I removed it. Adding a comment, something like [1], would make it
more verbose, hence I've not added.

I'm attaching the v6 patch set, please review it further.

[1]
/*
 * Use the first vector buffer to write the remaining size. Note that
 * zero buffer was already pointed to it above, hence just specifying
 * the size is enough here.
 */

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 7ae3f77c0444e6c5383e13a502457813b850fd08 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 30 Sep 2022 01:26:03 +
Subject: [PATCH v6] Move pg_pwritev_with_retry() to file_utils.c

Move pg_pwritev_with_retry() from file/fd.c to common/file_utils.c
so that non-backend code or FRONTEND cases can also use it.

Author: Bharath Rupireddy
Reviewed-by: Thomas Munro
Reviewed-by: Nathan Bossart
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJKwUrpP0igOFAv5khj3dbHvfWqLzFeo7vtNzDYXv_YZw%40mail.gmail.com
---
 src/backend/storage/file/fd.c   | 65 -
 src/common/file_utils.c | 65 +
 src/include/common/file_utils.h |  7 
 src/include/storage/fd.h|  6 ---
 4 files changed, 72 insertions(+), 71 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e4d954578c..4151cafec5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -93,7 +93,6 @@
 #include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "port/pg_iovec.h"
 #include "portability/mem.h"
 #include "postmaster/startup.h"
 #include "storage/fd.h"
@@ -3738,67 +3737,3 @@ data_sync_elevel(int elevel)
 {
 	return data_sync_retry ? elevel : PANIC;
 }
-
-/*
- * A convenience wrapper for pg_pwritev() that retries on partial write.  If an
- * error is returned, it is unspecified how much has been written.
- */
-ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
-{
-	struct iovec iov_copy[PG_IOV_MAX];
-	ssize_t		sum = 0;
-	ssize_t		part;
-
-	/* We'd better have space to make a copy, in case we need to retry. */
-	if (iovcnt > PG_IOV_MAX)
-	{
-		errno = EINVAL;
-		return -1;
-	}
-
-	for (;;)
-	{
-		/* Write as much as we can. */
-		part = pg_pwritev(fd, iov, iovcnt, offset);
-		if (part < 0)
-			return -1;
-
-#ifdef SIMULATE_SHORT_WRITE
-		part = Min(part, 4096);
-#endif
-
-		/* Count our progress. */
-		sum += part;
-		offset += part;
-
-		/* Step over iovecs that are done. */
-		while (iovcnt > 0 && iov->iov_len <= part)
-		{
-			part -= iov->iov_len;
-			++iov;
-			--iovcnt;
-		}
-
-		/* Are they all done? */
-		if (iovcnt == 0)
-		{
-			/* We don't expect the kernel to write more than requested. */
-			Assert(part == 0);
-			break;
-		}
-
-		/*
-		 * Move whatever's left to the front of our mutable copy and adjust
-		 * the leading iovec.
-		 */
-		Assert(iovcnt > 0);
-		memmove(iov_copy, iov, sizeof(*iov) * iovcnt);
-		Assert(iov->iov_len > part);
-		iov_copy[0].iov_base = (char *) iov_copy[0].iov_base + part;
-		iov_copy[0].iov_len -= part;
-		iov = iov_copy;
-	}
-
-	return sum;
-}
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index df4d6d240c..4a4bdc31c4 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -28,6 +28,7 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "port/pg_iovec.h"
 
 #ifdef FRONTEND
 
@@ -460,3 +461,67 @@ get_dirent_type(const char *path,
 
 	return result;
 }
+
+/*
+ * A

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Thomas Munro

On Fri, Sep 30, 2022 at 6:27 AM Nathan Bossart  wrote:
> On Thu, Sep 29, 2022 at 11:32:32AM +0530, Bharath Rupireddy wrote:
> > On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart  
> > wrote:
> >>
> >> +PGAlignedXLogBlock zbuffer;
> >> +
> >> +memset(zbuffer.data, 0, XLOG_BLCKSZ);
> >>
> >> This seems excessive for only writing a single byte.
> >
> > Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1,
> > wal_segment_size - 1).
>
> I don't think removing the use of PGAlignedXLogBlock here introduces any
> sort of alignment risk, so this should be alright.
>
> +#ifdef WIN32
> +/*
> + * WriteFile() on Windows changes the current file position, hence we
> + * need an explicit lseek() here. See pg_pwrite() implementation in
> + * win32pwrite.c for more details.
> + */
>
> Should we really surround this with a WIN32 check, or should we just
> unconditionally lseek() here?  I understand that this helps avoid an extra
> system call on many platforms, but in theory another platform introduced in
> the future could have the same problem, and this seems like something that
> could easily be missed.  Presumably we could do something fancier to
> indicate pg_pwrite()'s behavior in this regard, but I don't know if that
> sort of complexity is really worth it in order to save an lseek().

+1 for just doing it always, with a one-liner comment like
"pg_pwritev*() might move the file position".  No reason to spam the
source tree with more explanations of the exact reason.  If someone
ever comes up with another case where p- and non-p- I/O functions are
intermixed and it's really worth saving a system call (don't get me
wrong, we call lseek() an obscene amount elsewhere and I'd like to fix
that, but this case isn't hot?) then I like your idea of a macro to
tell you whether you need to.

Earlier I wondered why we'd want to include "pg_pwritev" in the name
of this zero-filling function (pwritev being an internal
implementation detail), but now it seems like maybe a good idea
because it highlights the file position portability problem by being a
member of that family of similarly-named functions.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Nathan Bossart

On Thu, Sep 29, 2022 at 11:32:32AM +0530, Bharath Rupireddy wrote:
> On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart  
> wrote:
>>
>> +PGAlignedXLogBlock zbuffer;
>> +
>> +memset(zbuffer.data, 0, XLOG_BLCKSZ);
>>
>> This seems excessive for only writing a single byte.
> 
> Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1,
> wal_segment_size - 1).

I don't think removing the use of PGAlignedXLogBlock here introduces any
sort of alignment risk, so this should be alright.

+#ifdef WIN32
+/*
+ * WriteFile() on Windows changes the current file position, hence we
+ * need an explicit lseek() here. See pg_pwrite() implementation in
+ * win32pwrite.c for more details.
+ */

Should we really surround this with a WIN32 check, or should we just
unconditionally lseek() here?  I understand that this helps avoid an extra
system call on many platforms, but in theory another platform introduced in
the future could have the same problem, and this seems like something that
could easily be missed.  Presumably we could do something fancier to
indicate pg_pwrite()'s behavior in this regard, but I don't know if that
sort of complexity is really worth it in order to save an lseek().

+iov[0].iov_base = zbuffer.data;

This seems superfluous, but I don't think it's hurting anything.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-29 Thread Bharath Rupireddy

On Sat, Sep 24, 2022 at 1:54 AM Nathan Bossart  wrote:
>
> +PGAlignedXLogBlock zbuffer;
> +
> +memset(zbuffer.data, 0, XLOG_BLCKSZ);
>
> This seems excessive for only writing a single byte.

Yes, I removed it now, instead doing pg_pwrite(fd, "\0", 1,
wal_segment_size - 1).

> +#ifdef WIN32
> +/*
> + * XXX: It looks like on Windows, we need an explicit lseek() call 
> here
> + * despite using pwrite() implementation from win32pwrite.c. 
> Otherwise
> + * an error occurs.
> + */
>
> I think this comment is too vague.  Can we describe the error in more
> detail?  Or better yet, can we fix it as a prerequisite to this patch set?

The commit b6d8a60aba322678585ebe11dab072a37ac32905 brings back
pg_pwrite() and its friends. This puts the responsibility of doing
lseek(SEEK_SET) on the callers if they wish to.

Please see the v5 patch set and review it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v5-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patch
Description: Binary data


v5-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch
Description: Binary data

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-28 Thread Thomas Munro

On Wed, Sep 28, 2022 at 5:43 PM Bharath Rupireddy
 wrote:
> On Wed, Sep 28, 2022 at 12:32 AM Thomas Munro  wrote:
> > FWIW I doubt the OS itself will change released
> > behaviour like that, but I did complain about the straight-up
> > misleading documentation and they agreed and fixed it[1].
> >
> > [1] https://github.com/MicrosoftDocs/sdk-api/pull/1309
>
> Great! Is there any plan to request for change in WriteFile() to not
> alter file position?

Not from me.  I stick to open source problems.  Reporting bugs in
documentation is legitimate self defence, though.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-27 Thread Bharath Rupireddy

On Wed, Sep 28, 2022 at 12:32 AM Thomas Munro  wrote:
>
> On Wed, Sep 28, 2022 at 1:03 AM Bharath Rupireddy
>  wrote:>
> > On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro  wrote:
> > > Something like the attached.
> >
> > Isn't it also better to add notes in win32pread.c and win32pwrite.c
> > about the callers doing lseek(SEEK_SET) if they wish to and Windows
> > implementations changing file position? We can also add a TODO item
> > about replacing pg_ versions with pread and friends someday when
> > Windows fixes the issue? Having it in the commit and include/port.h is
> > good, but the comments in win32pread.c and win32pwrite.c make life
> > easier IMO. Otherwise, the patch LGTM.
>
> Thanks, will do.

I'm looking forward to getting it in.

> FWIW I doubt the OS itself will change released
> behaviour like that, but I did complain about the straight-up
> misleading documentation and they agreed and fixed it[1].
>
> [1] https://github.com/MicrosoftDocs/sdk-api/pull/1309

Great! Is there any plan to request for change in WriteFile() to not
alter file position?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-27 Thread Thomas Munro

On Wed, Sep 28, 2022 at 1:03 AM Bharath Rupireddy
 wrote:>
> On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro  wrote:
> > Something like the attached.
>
> Isn't it also better to add notes in win32pread.c and win32pwrite.c
> about the callers doing lseek(SEEK_SET) if they wish to and Windows
> implementations changing file position? We can also add a TODO item
> about replacing pg_ versions with pread and friends someday when
> Windows fixes the issue? Having it in the commit and include/port.h is
> good, but the comments in win32pread.c and win32pwrite.c make life
> easier IMO. Otherwise, the patch LGTM.

Thanks, will do.  FWIW I doubt the OS itself will change released
behaviour like that, but I did complain about the straight-up
misleading documentation and they agreed and fixed it[1].

After some looking around, the only way I could find to avoid this
behaviour is to switch to async handles, which do behave as documented
in this respect, but then you can't use plain read/write at all unless
you write replacement wrappers for those too, and AFAICT that adds
more system calls (start write, wait for write to finish) and
complexity/weirdness without any real payoff so it seems like the
wrong direction, at least without more research that I'm not pursuing
currently.  (FWIW in AIO porting experiments a while back we used
async handles to get IOs running concurrently with CPU work and
possibly other IOs so it was actually worth doing, but that was only
for smgr and the main wal writing code where there's no intermixed
plain read/write calls as you have here, so the problem doesn't even
come up.)

[1] https://github.com/MicrosoftDocs/sdk-api/pull/1309

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-27 Thread Bharath Rupireddy

On Tue, Sep 27, 2022 at 3:01 PM Thomas Munro  wrote:
>
> Something like the attached.

Isn't it also better to add notes in win32pread.c and win32pwrite.c
about the callers doing lseek(SEEK_SET) if they wish to and Windows
implementations changing file position? We can also add a TODO item
about replacing pg_ versions with pread and friends someday when
Windows fixes the issue? Having it in the commit and include/port.h is
good, but the comments in win32pread.c and win32pwrite.c make life
easier IMO. Otherwise, the patch LGTM.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-27 Thread Thomas Munro

On Tue, Sep 27, 2022 at 6:43 PM Bharath Rupireddy
 wrote:
> On Tue, Sep 27, 2022 at 3:08 AM Thomas Munro  wrote:
> >
> > I don't think so, that's an extra kernel call.  I think I'll just have
> > to revert part of my recent change that removed the pg_ prefix from
> > those function names in our code, and restore the comment that warns
> > you about the portability hazard (I thought it went away with HP-UX
> > 10, where we were literally calling lseek() before every write()).
> > The majority of users of these functions don't intermix them with
> > calls to read()/write(), so they don't care about the file position,
> > so I think it's just something we'll have to continue to be mindful of
> > in the places that do.
>
> Yes, all of the existing pwrite() callers don't care about the file
> position, but the new callers such as the actual idea and patch
> proposed here in this thread cares.
>
> Is this the commit cf112c122060568aa06efe4e6e6fb9b2dd4f1090 part of
> which [1] you're planning to revert?

Yeah, just the renaming parts of that.  The lseek()-based emulation is
definitely not coming back.  Something like the attached.
From fc72be32668f0c37b899fd922d6986f39d8a6a2a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 27 Sep 2022 21:12:03 +1300
Subject: [PATCH] Restore pg_pread and friends.

Commit cf112c12 was a little too hasty in getting rid of the pg_
prefixes where we use pread(), pwrite() and vectored variants.  We
dropped support for ancient Unixes that needed to use lseek() to
implement replacements for those, but it turned out that Windows also
changes the current position even when you pass in an offset to
ReadFile() and WriteFile(), if the file handle is synchronous.

Switching to asynchronous file handles would fix that, but have other
complications.  For now let's just put back the pg_ prefix and add some
comments to highlight the non-standard side-effect.

Reported-by: Bharath Rupireddy 
Discussion: https://postgr.es/m/20220923202439.GA1156054%40nathanxps13
---
 .../pg_stat_statements/pg_stat_statements.c   |  4 +-
 src/backend/access/heap/rewriteheap.c |  2 +-
 src/backend/access/transam/slru.c |  4 +-
 src/backend/access/transam/xlog.c |  4 +-
 src/backend/access/transam/xlogreader.c   |  2 +-
 src/backend/access/transam/xlogrecovery.c |  2 +-
 src/backend/backup/basebackup.c   |  2 +-
 src/backend/replication/walreceiver.c |  2 +-
 src/backend/storage/file/fd.c |  8 +--
 src/backend/utils/init/miscinit.c |  2 +-
 src/bin/pg_test_fsync/pg_test_fsync.c | 50 +--
 src/include/access/xlogreader.h   |  4 +-
 src/include/port.h|  9 
 src/include/port/pg_iovec.h   | 13 -
 src/include/port/win32_port.h |  4 +-
 src/port/preadv.c |  4 +-
 src/port/pwritev.c|  4 +-
 src/port/win32pread.c |  2 +-
 src/port/win32pwrite.c|  2 +-
 19 files changed, 71 insertions(+), 53 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index ba868f0de9..73439c0199 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -2103,9 +2103,9 @@ qtext_store(const char *query, int query_len,
if (fd < 0)
goto error;
 
-   if (pwrite(fd, query, query_len, off) != query_len)
+   if (pg_pwrite(fd, query, query_len, off) != query_len)
goto error;
-   if (pwrite(fd, "\0", 1, off + query_len) != 1)
+   if (pg_pwrite(fd, "\0", 1, off + query_len) != 1)
goto error;
 
CloseTransientFile(fd);
diff --git a/src/backend/access/heap/rewriteheap.c 
b/src/backend/access/heap/rewriteheap.c
index 2f08fbe8d3..b01b39b008 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1150,7 +1150,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
/* write out tail end of mapping file (again) */
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE);
-   if (pwrite(fd, data, len, xlrec->offset) != len)
+   if (pg_pwrite(fd, data, len, xlrec->offset) != len)
{
/* if write didn't set errno, assume problem is no disk space */
if (errno == 0)
diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index c9a7b97949..b65cb49d7f 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -718,7 +718,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
-   if (pread(fd, shared->page_buffer[slotno], BLCKSZ, offset) != BLCKSZ)
+   if (pg_pread(fd,

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-26 Thread Bharath Rupireddy

On Tue, Sep 27, 2022 at 3:08 AM Thomas Munro  wrote:
>
> I don't think so, that's an extra kernel call.  I think I'll just have
> to revert part of my recent change that removed the pg_ prefix from
> those function names in our code, and restore the comment that warns
> you about the portability hazard (I thought it went away with HP-UX
> 10, where we were literally calling lseek() before every write()).
> The majority of users of these functions don't intermix them with
> calls to read()/write(), so they don't care about the file position,
> so I think it's just something we'll have to continue to be mindful of
> in the places that do.

Yes, all of the existing pwrite() callers don't care about the file
position, but the new callers such as the actual idea and patch
proposed here in this thread cares.

Is this the commit cf112c122060568aa06efe4e6e6fb9b2dd4f1090 part of
which [1] you're planning to revert? If so, will we have the
pg_{pwrite, pread} back? IIUC, we don't have lseek(SEEK_SET) in
pg_{pwrite, pread} right? It is the callers responsibility to set the
file position correctly if they wish to, isn't it? Oftentimes,
developers miss the notes in the function comments and use these
functions expecting them to not change file position which works well
on non-Windows platforms but fails on Windows.

This makes me think that we can have pwrite(), pread() introduced by
cf112c122060568aa06efe4e6e6fb9b2dd4f1090 as-is and re-introduce
pg_{pwrite, pread} with pwrite()/pread()+lseek(SEEK_SET) in
win32pwrite.c and win32pread.c. These functions reduce the caller's
efforts and reduce the duplicate code. If okay, I'm happy to lend my
hands on this patch.

Thoughts?

> Unless, that is, I can find a way to stop it from doing that...  I've
> added this to my Windows to-do list.  I am going to have a round of
> Windows hacking quite soon.

Thanks! I think it's time for me to start a new thread just for this
to get more attention and opinion from other hackers and not to
sidetrack the original idea and patch proposed in this thread.

[1] 
https://github.com/postgres/postgres/blob/REL_14_STABLE/src/port/pwrite.c#L11

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-26 Thread Nathan Bossart

On Tue, Sep 27, 2022 at 10:37:38AM +1300, Thomas Munro wrote:
> I don't think so, that's an extra kernel call.  I think I'll just have
> to revert part of my recent change that removed the pg_ prefix from
> those function names in our code, and restore the comment that warns
> you about the portability hazard (I thought it went away with HP-UX
> 10, where we were literally calling lseek() before every write()).
> The majority of users of these functions don't intermix them with
> calls to read()/write(), so they don't care about the file position,
> so I think it's just something we'll have to continue to be mindful of
> in the places that do.

Ah, you're right, it's probably best to avoid the extra system call for the
majority of callers that don't care about the file position.  I retract my
previous message.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-26 Thread Thomas Munro

On Tue, Sep 27, 2022 at 10:27 AM Nathan Bossart
 wrote:
> On Mon, Sep 26, 2022 at 08:33:53PM +0530, Bharath Rupireddy wrote:
> > Irrespective of what Windows does with file pointers in WriteFile(),
> > should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
> > something like [5]? This is rather hackish without fully knowing what
> > Windows does internally in WriteFile(), but this does fix inherent
> > issues that our pwrite() callers (there are quite a number of places
> > that use pwrite() and presumes file pointer doesn't change on Windows)
> > may have on Windows. See the regression tests passing [6] with the fix
> > [5].
>
> I think so.  I don't see why we would rather have each caller ensure
> pwrite() behaves as documented.

I don't think so, that's an extra kernel call.  I think I'll just have
to revert part of my recent change that removed the pg_ prefix from
those function names in our code, and restore the comment that warns
you about the portability hazard (I thought it went away with HP-UX
10, where we were literally calling lseek() before every write()).
The majority of users of these functions don't intermix them with
calls to read()/write(), so they don't care about the file position,
so I think it's just something we'll have to continue to be mindful of
in the places that do.

Unless, that is, I can find a way to stop it from doing that...  I've
added this to my Windows to-do list.  I am going to have a round of
Windows hacking quite soon.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-26 Thread Nathan Bossart

On Mon, Sep 26, 2022 at 08:33:53PM +0530, Bharath Rupireddy wrote:
> Irrespective of what Windows does with file pointers in WriteFile(),
> should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
> something like [5]? This is rather hackish without fully knowing what
> Windows does internally in WriteFile(), but this does fix inherent
> issues that our pwrite() callers (there are quite a number of places
> that use pwrite() and presumes file pointer doesn't change on Windows)
> may have on Windows. See the regression tests passing [6] with the fix
> [5].

I think so.  I don't see why we would rather have each caller ensure
pwrite() behaves as documented.

> +   /*
> +* On Windows, it is found that WriteFile() changes the file
> pointer and we
> +* want pwrite() to not change. Hence, we explicitly reset the
> file pointer
> +* to beginning of the file.
> +*/
> +   if (lseek(fd, 0, SEEK_SET) != 0)
> +   {
> +   _dosmaperr(GetLastError());
> +   return -1;
> +   }
> +
> return result;
>  }

Why reset to the beginning of the file?  Shouldn't we reset it to what it
was before the call to pwrite()?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-26 Thread Bharath Rupireddy

On Sat, Sep 24, 2022 at 9:44 AM Thomas Munro  wrote:
>
> Although WriteFile() with a synchronous file handle and an explicit
> offset doesn't use the current file position, it appears that it still
> changes it.  :-(
>
> You'd think from the documentation[1] that that isn't the case, because it 
> says:
>
> "Considerations for working with synchronous file handles:
>
>  * If lpOverlapped is NULL, the write operation starts at the current
> file position and WriteFile does not return until the operation is
> complete, and the system updates the file pointer before WriteFile
> returns.
>
>  * If lpOverlapped is not NULL, the write operation starts at the
> offset that is specified in the OVERLAPPED structure and WriteFile
> does not return until the write operation is complete. The system
> updates the OVERLAPPED Internal and InternalHigh fields before
> WriteFile returns."
>
> So it's explicitly saying the file pointer is updated in the first
> bullet point and not the second, but src/port/win32pwrite.c is doing
> the second thing.

The WriteFile() and pwrite() implementation in win32pwrite.c are
changing the file pointer on Windows, see the following and [4] for
more details.

# Running: pg_basebackup --no-sync -cfast -D
C:\cirrus\build\testrun\pg_basebackup\010_pg_basebackup\data\tmp_test_sV4r/backup2
--no-manifest --waldir
C:\cirrus\build\testrun\pg_basebackup\010_pg_basebackup\data\tmp_test_sV4r/xlog2
pg_basebackup: offset_before 0 and offset_after 16777216 aren't the same
Assertion failed: offset_before == offset_after, file
../src/bin/pg_basebackup/walmethods.c, line 254

Irrespective of what Windows does with file pointers in WriteFile(),
should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
something like [5]? This is rather hackish without fully knowing what
Windows does internally in WriteFile(), but this does fix inherent
issues that our pwrite() callers (there are quite a number of places
that use pwrite() and presumes file pointer doesn't change on Windows)
may have on Windows. See the regression tests passing [6] with the fix
[5].

> Yet the following assertion added to Bharath's code
> fails[2]:

That was a very quick patch though, here's another version adding
offset checks and an assertion [3], see the assertion failures here
[4].

I also think that we need to discuss this issue separately.

Thoughts?

> [1] 
> https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position
> [2] 
> https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup
[3] - 
https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v2
[4] - 
https://api.cirrus-ci.com/v1/artifact/task/5294264635621376/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup
[5]
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 7f2e62e8a7..542b548279 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -37,5 +37,16 @@ pwrite(int fd, const void *buf, size_t size, off_t offset)
return -1;
}

+   /*
+* On Windows, it is found that WriteFile() changes the file
pointer and we
+* want pwrite() to not change. Hence, we explicitly reset the
file pointer
+* to beginning of the file.
+*/
+   if (lseek(fd, 0, SEEK_SET) != 0)
+   {
+   _dosmaperr(GetLastError());
+   return -1;
+   }
+
return result;
 }
[6] - 
https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v3

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-23 Thread Thomas Munro

On Sat, Sep 24, 2022 at 8:24 AM Nathan Bossart  wrote:
> +#ifdef WIN32
> +/*
> + * XXX: It looks like on Windows, we need an explicit lseek() call 
> here
> + * despite using pwrite() implementation from win32pwrite.c. 
> Otherwise
> + * an error occurs.
> + */
>
> I think this comment is too vague.  Can we describe the error in more
> detail?  Or better yet, can we fix it as a prerequisite to this patch set?

Although WriteFile() with a synchronous file handle and an explicit
offset doesn't use the current file position, it appears that it still
changes it.  :-(

You'd think from the documentation[1] that that isn't the case, because it says:

"Considerations for working with synchronous file handles:

 * If lpOverlapped is NULL, the write operation starts at the current
file position and WriteFile does not return until the operation is
complete, and the system updates the file pointer before WriteFile
returns.

 * If lpOverlapped is not NULL, the write operation starts at the
offset that is specified in the OVERLAPPED structure and WriteFile
does not return until the write operation is complete. The system
updates the OVERLAPPED Internal and InternalHigh fields before
WriteFile returns."

So it's explicitly saying the file pointer is updated in the first
bullet point and not the second, but src/port/win32pwrite.c is doing
the second thing.  Yet the following assertion added to Bharath's code
fails[2]:

+++ b/src/bin/pg_basebackup/walmethods.c
@@ -221,6 +221,10 @@ dir_open_for_write(WalWriteMethod *wwmethod,
const char *pathname,
if (pad_to_size && wwmethod->compression_algorithm ==
PG_COMPRESSION_NONE)
{
ssize_t rc;
+   off_t before_offset;
+   off_t after_offset;
+
+   before_offset = lseek(fd, 0, SEEK_CUR);

rc = pg_pwritev_zeros(fd, pad_to_size);

@@ -231,6 +235,9 @@ dir_open_for_write(WalWriteMethod *wwmethod, const
char *pathname,
return NULL;
}

+   after_offset = lseek(fd, 0, SEEK_CUR);
+   Assert(before_offset == after_offset);
+

[1] 
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position
[2] 
https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-23 Thread Nathan Bossart

+PGAlignedXLogBlock zbuffer;
+
+memset(zbuffer.data, 0, XLOG_BLCKSZ);

This seems excessive for only writing a single byte.

+#ifdef WIN32
+/*
+ * XXX: It looks like on Windows, we need an explicit lseek() call here
+ * despite using pwrite() implementation from win32pwrite.c. Otherwise
+ * an error occurs.
+ */

I think this comment is too vague.  Can we describe the error in more
detail?  Or better yet, can we fix it as a prerequisite to this patch set?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-23 Thread Bharath Rupireddy

On Wed, Sep 21, 2022 at 4:30 AM Nathan Bossart  wrote:
>
> 0001 looks reasonable to me.

Thanks for reviewing.

> +errno = 0;
> +rc = pg_pwritev_zeros(fd, pad_to_size);
>
> Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
> appropriately.

Right, pg_pwritev_zeros(), (rather pg_pwritev_with_retry() ensures
that pwritev() or pwrite()) sets the correct errno, please see
Thomas's comments [1] as well. Removed it.

> +/*
> + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
> + * writing more bytes per pg_pwritev_with_retry() call is proven to be more
> + * performant.
> + */
> +#define PWRITEV_BLCKSZ  XLOG_BLCKSZ
>
> This seems like something we should sort out now instead of leaving as
> future work.  Given your recent note, I think we should just use
> XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
> findings with different buffer sizes.

Agreed. Removed the new structure and added a comment.

Another change that I had to do was to allow lseek() even after
pwrite() (via pg_pwritev_zeros()) on Windows in walmethods.c. Without
this, the regression tests start to fail on Windows. And I figured out
that it's not an issue with this patch, it looks like an issue with
pwrite() implementation in win32pwrite.c, see the failure here [2], I
plan to start a separate thread to discuss this.

Please review the attached v4 patch set further.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJKwUrpP0igOFAv5khj3dbHvfWqLzFeo7vtNzDYXv_YZw%40mail.gmail.com
[2] 
https://github.com/BRupireddy/postgres/tree/use_pwrite_without_lseek_on_windiws

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From c60026519becfc513dc06ddf889caa3b4b0fade3 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 23 Sep 2022 00:47:56 +
Subject: [PATCH v4] Move pg_pwritev_with_retry() to file_utils.c

Move pg_pwritev_with_retry() from file/fd.c to common/file_utils.c
so that non-backend code or FRONTEND cases can also use it.
---
 src/backend/storage/file/fd.c   | 65 -
 src/common/file_utils.c | 65 +
 src/include/common/file_utils.h |  7 
 src/include/storage/fd.h|  6 ---
 4 files changed, 72 insertions(+), 71 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 073dab2be5..75ba178dfb 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -93,7 +93,6 @@
 #include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "port/pg_iovec.h"
 #include "portability/mem.h"
 #include "postmaster/startup.h"
 #include "storage/fd.h"
@@ -3738,67 +3737,3 @@ data_sync_elevel(int elevel)
 {
 	return data_sync_retry ? elevel : PANIC;
 }
-
-/*
- * A convenience wrapper for pwritev() that retries on partial write.  If an
- * error is returned, it is unspecified how much has been written.
- */
-ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
-{
-	struct iovec iov_copy[PG_IOV_MAX];
-	ssize_t		sum = 0;
-	ssize_t		part;
-
-	/* We'd better have space to make a copy, in case we need to retry. */
-	if (iovcnt > PG_IOV_MAX)
-	{
-		errno = EINVAL;
-		return -1;
-	}
-
-	for (;;)
-	{
-		/* Write as much as we can. */
-		part = pwritev(fd, iov, iovcnt, offset);
-		if (part < 0)
-			return -1;
-
-#ifdef SIMULATE_SHORT_WRITE
-		part = Min(part, 4096);
-#endif
-
-		/* Count our progress. */
-		sum += part;
-		offset += part;
-
-		/* Step over iovecs that are done. */
-		while (iovcnt > 0 && iov->iov_len <= part)
-		{
-			part -= iov->iov_len;
-			++iov;
-			--iovcnt;
-		}
-
-		/* Are they all done? */
-		if (iovcnt == 0)
-		{
-			/* We don't expect the kernel to write more than requested. */
-			Assert(part == 0);
-			break;
-		}
-
-		/*
-		 * Move whatever's left to the front of our mutable copy and adjust
-		 * the leading iovec.
-		 */
-		Assert(iovcnt > 0);
-		memmove(iov_copy, iov, sizeof(*iov) * iovcnt);
-		Assert(iov->iov_len > part);
-		iov_copy[0].iov_base = (char *) iov_copy[0].iov_base + part;
-		iov_copy[0].iov_len -= part;
-		iov = iov_copy;
-	}
-
-	return sum;
-}
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index df4d6d240c..4af216e56c 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -28,6 +28,7 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "port/pg_iovec.h"
 
 #ifdef FRONTEND
 
@@ -460,3 +461,67 @@ get_dirent_type(const char *path,
 
 	return result;
 }
+
+/*
+ * A convenience wrapper for pwritev() that retries on partial write.  If an
+ * error is returned, it is unspecified how much has been written.
+ */
+ssize_t
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
+	struct iovec iov_copy[PG_IOV_MAX];
+	ssize_t		sum = 0;
+	ssize_t		part;
+
+	/* We'd better have space to make a copy, in case we need to

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-20 Thread Nathan Bossart

On Tue, Sep 20, 2022 at 04:00:26PM -0700, Nathan Bossart wrote:
> On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:
>> I'm attaching v5 patch-set. I've addressed review comments received so
>> far and fixed a compiler warning that CF bot complained about.
>> 
>> Please review it further.
> 
> 0001 looks reasonable to me.
> 
> +errno = 0;
> +rc = pg_pwritev_zeros(fd, pad_to_size);
> 
> Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
> appropriately.
> 
> +/*
> + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
> + * writing more bytes per pg_pwritev_with_retry() call is proven to be more
> + * performant.
> + */
> +#define PWRITEV_BLCKSZ  XLOG_BLCKSZ
> 
> This seems like something we should sort out now instead of leaving as
> future work.  Given your recent note, I think we should just use
> XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
> findings with different buffer sizes.

I also noticed that the latest patch set no longer applies, so I've marked
the commitfest entry as waiting-on-author.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-09-20 Thread Nathan Bossart

On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:
> I'm attaching v5 patch-set. I've addressed review comments received so
> far and fixed a compiler warning that CF bot complained about.
> 
> Please review it further.

0001 looks reasonable to me.

+errno = 0;
+rc = pg_pwritev_zeros(fd, pad_to_size);

Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
appropriately.

+/*
+ * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
+ * writing more bytes per pg_pwritev_with_retry() call is proven to be more
+ * performant.
+ */
+#define PWRITEV_BLCKSZ  XLOG_BLCKSZ

This seems like something we should sort out now instead of leaving as
future work.  Given your recent note, I think we should just use
XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
findings with different buffer sizes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-09 Thread Bharath Rupireddy

On Mon, Aug 8, 2022 at 6:10 PM Bharath Rupireddy
 wrote:
>
> I played with a simple insert use-case [1] that generates ~380 WAL
> files, with different block sizes. To my surprise, I have not seen any
> improvement with larger block sizes. I may be doing something wrong
> here, suggestions on to test and see the benefits are welcome.
>
> > > I think this should also handle the remainder after processing whole
> > > blocks, just for completeness.  If I call the code as presented with size
> > > 8193, I think this code will only write 8192 bytes.
> >
> > Hm, I will fix it.
>
> Fixed.
>
> I'm attaching v5 patch-set. I've addressed review comments received so
> far and fixed a compiler warning that CF bot complained about.
>
> Please review it further.

I tried to vary the zero buffer size to see if there's any huge
benefit for the WAL-generating queries. Unfortunately, I didn't see
any benefit on my dev system (16 vcore, 512GB SSD, 32GB RAM) . The use
case I've tried is at [1] and the results are at [2].

Having said that, the use of pg_pwritev_with_retry() in walmethods.c
will definitely reduce number of system calls - on HEAD the
dir_open_for_write() makes pad_to_size/XLOG_BLCKSZ i.e. 16MB/8KB =
2,048 write() calls and with patch it makes only 64
pg_pwritev_with_retry() calls with XLOG_BLCKSZ zero buffer size. The
proposed patches will provide straight 32x reduction in system calls
(for pg_receivewal and pg_basebackup) apart from the safety against
partial writes.

[1]
/* built source code with release flags */
./configure --with-zlib --enable-depend --prefix=$PWD/inst/
--with-openssl --with-readline --with-perl --with-libxml CFLAGS='-O2'
> install.log && make -j 8 install > install.log 2>&1 &

\q
./pg_ctl -D data -l logfile stop
rm -rf data

/* ensured that nothing exists in OS page cache */
free -m
sudo su
sync; echo 3 > /proc/sys/vm/drop_caches
exit
free -m

./initdb -D data
./pg_ctl -D data -l logfile start
./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "64GB";'
./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";'
./psql -d postgres -c 'ALTER SYSTEM SET work_mem = "16MB";'
./psql -d postgres -c 'ALTER SYSTEM SET checkpoint_timeout = "1d";'
./pg_ctl -D data -l logfile restart
./psql -d postgres -c 'create table foo(bar int);'
./psql -d postgres
\timing
insert into foo select * from generate_series(1, 1);   /* this
query generates about 385 WAL files, no checkpoint hence no recycle of
old WAL files, all new WAL files */

[2]
HEAD
Time: 84249.535 ms (01:24.250)

HEAD with wal_init_zero off
Time: 75086.300 ms (01:15.086)

#definePWRITEV_BLCKSZXLOG_BLCKSZ
Time: 85254.302 ms (01:25.254)

#definePWRITEV_BLCKSZ(4 * XLOG_BLCKSZ)
Time: 83542.885 ms (01:23.543)

#definePWRITEV_BLCKSZ(16 * XLOG_BLCKSZ)
Time: 84035.770 ms (01:24.036)

#definePWRITEV_BLCKSZ(64 * XLOG_BLCKSZ)
Time: 84749.021 ms (01:24.749)

#definePWRITEV_BLCKSZ(256 * XLOG_BLCKSZ)
Time: 84273.466 ms (01:24.273)

#definePWRITEV_BLCKSZ(512 * XLOG_BLCKSZ)
Time: 84233.576 ms (01:24.234)

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-08 Thread Bharath Rupireddy

On Sun, Aug 7, 2022 at 9:22 PM Bharath Rupireddy
 wrote:
>
> On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro  wrote:
> >
> > > A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> > > designed to work on WAL segments initialization and it uses
> > > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> > > in its name that tells us so.  This makes me question whether
> > > file_utils.c is a good location for this second thing.  Could a new
> > > file be a better location?  We have a xlogutils.c in the backend, and
> > > a name similar to that in src/common/ would be one possibility.
> >
> > Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
> > maybe it's OK to use BLCKSZ with a comment to say that it's a bit
> > arbitrary, or maybe it's better to define a new zero buffer of some
> > arbitrary size just in this code if that is too strange.  We could
> > experiment with different size buffers to see how it performs, bearing
> > in mind that every time we double it you halve the number of system
> > calls, but also bearing in mind that at some point it's too much for
> > the stack.  I can tell you that the way that code works today was not
> > really written with performance in mind (unlike, say, the code
> > reverted from 9.4 that tried to do this with posix_fallocate()), it
> > was just finding an excuse to call pwritev(), to exercise new fallback
> > code being committed for use by later AIO stuff (more patches coming
> > soon).  The retry support was added because it seemed plausible that
> > some system out there would start to do short writes as we cranked up
> > the sizes for some implementation reason other than ENOSPC, so we
> > should make a reusable retry routine.
>
> Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ
> reduces the system calls to half (right now, pg_pwritev_with_retry()
> gets called 64 times per 16MB WAL file, it writes in the batches of 32
> blocks per call).
>
> Is there a ready-to-use tool or script or specific settings for
> pgbench (pgbench command line options or GUC settings) that I can play
> with to measure the performance?

I played with a simple insert use-case [1] that generates ~380 WAL
files, with different block sizes. To my surprise, I have not seen any
improvement with larger block sizes. I may be doing something wrong
here, suggestions on to test and see the benefits are welcome.

> > I think this should also handle the remainder after processing whole
> > blocks, just for completeness.  If I call the code as presented with size
> > 8193, I think this code will only write 8192 bytes.
>
> Hm, I will fix it.

Fixed.

I'm attaching v5 patch-set. I've addressed review comments received so
far and fixed a compiler warning that CF bot complained about.

Please review it further.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/


v3-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patch
Description: Binary data


v3-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch
Description: Binary data

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-07 Thread Bharath Rupireddy

On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro  wrote:
>
> > A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> > designed to work on WAL segments initialization and it uses
> > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> > in its name that tells us so.  This makes me question whether
> > file_utils.c is a good location for this second thing.  Could a new
> > file be a better location?  We have a xlogutils.c in the backend, and
> > a name similar to that in src/common/ would be one possibility.
>
> Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
> maybe it's OK to use BLCKSZ with a comment to say that it's a bit
> arbitrary, or maybe it's better to define a new zero buffer of some
> arbitrary size just in this code if that is too strange.  We could
> experiment with different size buffers to see how it performs, bearing
> in mind that every time we double it you halve the number of system
> calls, but also bearing in mind that at some point it's too much for
> the stack.  I can tell you that the way that code works today was not
> really written with performance in mind (unlike, say, the code
> reverted from 9.4 that tried to do this with posix_fallocate()), it
> was just finding an excuse to call pwritev(), to exercise new fallback
> code being committed for use by later AIO stuff (more patches coming
> soon).  The retry support was added because it seemed plausible that
> some system out there would start to do short writes as we cranked up
> the sizes for some implementation reason other than ENOSPC, so we
> should make a reusable retry routine.

Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ
reduces the system calls to half (right now, pg_pwritev_with_retry()
gets called 64 times per 16MB WAL file, it writes in the batches of 32
blocks per call).

Is there a ready-to-use tool or script or specific settings for
pgbench (pgbench command line options or GUC settings) that I can play
with to measure the performance?

> I think this should also handle the remainder after processing whole
> blocks, just for completeness.  If I call the code as presented with size
> 8193, I think this code will only write 8192 bytes.

Hm, I will fix it.

> I think if this ever needs to work on O_DIRECT files there would be an
> alignment constraint on the buffer and size, but I don't think we have
> to worry about that for now.

We can add a comment about the above limitation, if required.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-07 Thread Thomas Munro

On Sun, Aug 7, 2022 at 7:56 PM Michael Paquier  wrote:
> FWIW, when it comes to that we have a couple of routines that just use
> '0' to mean such a thing, aka palloc0().  I find 0002 confusing, as it
> introduces in fe_utils.c a new wrapper
> (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper
> (pg_pwritev_with_retry) for pwrite().

What about pg_write_initial_zeros(fd, size)?  How it writes zeros (ie
using vector I/O, and retrying) seems to be an internal detail, no?
"initial" to make it clearer that it's at offset 0, or maybe
pg_pwrite_zeros(fd, size, offset).

> A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> designed to work on WAL segments initialization and it uses
> XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> in its name that tells us so.  This makes me question whether
> file_utils.c is a good location for this second thing.  Could a new
> file be a better location?  We have a xlogutils.c in the backend, and
> a name similar to that in src/common/ would be one possibility.

Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
maybe it's OK to use BLCKSZ with a comment to say that it's a bit
arbitrary, or maybe it's better to define a new zero buffer of some
arbitrary size just in this code if that is too strange.  We could
experiment with different size buffers to see how it performs, bearing
in mind that every time we double it you halve the number of system
calls, but also bearing in mind that at some point it's too much for
the stack.  I can tell you that the way that code works today was not
really written with performance in mind (unlike, say, the code
reverted from 9.4 that tried to do this with posix_fallocate()), it
was just finding an excuse to call pwritev(), to exercise new fallback
code being committed for use by later AIO stuff (more patches coming
soon).  The retry support was added because it seemed plausible that
some system out there would start to do short writes as we cranked up
the sizes for some implementation reason other than ENOSPC, so we
should make a reusable retry routine.

I think this should also handle the remainder after processing whole
blocks, just for completeness.  If I call the code as presented with size
8193, I think this code will only write 8192 bytes.

I think if this ever needs to work on O_DIRECT files there would be an
alignment constraint on the buffer and size, but I don't think we have
to worry about that for now.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-07 Thread Michael Paquier

On Sun, Aug 07, 2022 at 10:41:49AM +0530, Bharath Rupireddy wrote:
> Agree. I separated out the changes.

+
+/*
+ * A convenience wrapper for pwritev() that retries on partial write.  If an
+ * error is returned, it is unspecified how much has been written.
+ */
+ssize_t
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t 
offset)

If moving this routine, this could use a more explicit description,
especially on errno, for example, that could be consumed by the caller
on failure to know what's happening. 

>> +/*
>> + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the 
>> given
>> + * file of size total_sz in batches of size block_sz.
>> + */
>> +ssize_t
>> +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)
>>
>> Hmm, why not give it a proper name that says it writes zeroes?
> 
> Done.

FWIW, when it comes to that we have a couple of routines that just use
'0' to mean such a thing, aka palloc0().  I find 0002 confusing, as it
introduces in fe_utils.c a new wrapper
(pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper
(pg_pwritev_with_retry) for pwrite().

A second thing is that pg_pwritev_with_retry_and_write_zeros() is
designed to work on WAL segments initialization and it uses
XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
in its name that tells us so.  This makes me question whether
file_utils.c is a good location for this second thing.  Could a new
file be a better location?  We have a xlogutils.c in the backend, and
a name similar to that in src/common/ would be one possibility.
--
Michael


signature.asc
Description: PGP signature

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-06 Thread Bharath Rupireddy

On Sun, Aug 7, 2022 at 7:43 AM Thomas Munro  wrote:
>
> On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy
>  wrote:
> > On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier  wrote:
> > Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
> > so that everyone can use it.
> >
> > > > Thoughts?
> > >
> > > Makes sense to me for the WAL segment pre-padding initialization, as
> > > we still want to point to the beginning of the segment after we are
> > > done with the pre-padding, and the code has an extra lseek().
> >
> > Thanks. I attached the v1 patch, please review it.
>
> Hi Bharath,
>
> +1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the
> commit message should say that is happening.  Maybe the move should
> even be in a separate patch (IMHO it's nice to separate mechanical
> change patches from new logic/work patches).

Agree. I separated out the changes.

> +/*
> + * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the 
> given
> + * file of size total_sz in batches of size block_sz.
> + */
> +ssize_t
> +pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)
>
> Hmm, why not give it a proper name that says it writes zeroes?

Done.

> Size arguments around syscall-like facilities are usually size_t.
>
> +memset(zbuffer.data, 0, block_sz);
>
> I believe the modern fashion as of a couple of weeks ago is to tell
> the compiler to zero-initialise.  Since it's a union you'd need
> designated initialiser syntax, something like zbuffer = { .data = {0}
> }?

Hm, but we have many places still using memset(). If we were to change
these syntaxes, IMO, it must be done separately.

> +iov[i].iov_base = zbuffer.data;
> +iov[i].iov_len = block_sz;
>
> How can it be OK to use caller supplied block_sz, when
> sizeof(zbuffer.data) is statically determined?  What is the point of
> this argument?

Yes, removed block_sz function parameter.

> -dir_data->lasterrno = errno;
> +/* If errno isn't set, assume problem is no disk space */
> +dir_data->lasterrno = errno ? errno : ENOSPC;
>
> This coding pattern is used in places where we blame short writes on
> lack of disk space without bothering to retry.  But the code used in
> this patch already handles short writes: it always retries, until
> eventually, if you really are out of disk space, you should get an
> actual ENOSPC from the operating system.  So I don't think the
> guess-it-must-be-ENOSPC technique applies here.

Done.

Thanks for reviewing. PSA v2 patch-set. Also,I added a CF entry
https://commitfest.postgresql.org/39/3803/ to give the patches a
chance to get tested.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
From c862c9d079c3f78d2eb0405d5c4addd93847f102 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 7 Aug 2022 04:40:10 +
Subject: [PATCH v2] Move pg_pwritev_with_retry() to file_utils.c

Move pg_pwritev_with_retry() from file/fd.c to common/file_utils.c
so that non-backend code or FRONTEND cases can also use it.
---
 src/backend/storage/file/fd.c   | 65 -
 src/common/file_utils.c | 65 +
 src/include/common/file_utils.h |  7 
 src/include/storage/fd.h|  6 ---
 4 files changed, 72 insertions(+), 71 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index efb34d4dcb..ac611b836f 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -95,7 +95,6 @@
 #include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "port/pg_iovec.h"
 #include "portability/mem.h"
 #include "postmaster/startup.h"
 #include "storage/fd.h"
@@ -3747,67 +3746,3 @@ data_sync_elevel(int elevel)
 {
 	return data_sync_retry ? elevel : PANIC;
 }
-
-/*
- * A convenience wrapper for pwritev() that retries on partial write.  If an
- * error is returned, it is unspecified how much has been written.
- */
-ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
-{
-	struct iovec iov_copy[PG_IOV_MAX];
-	ssize_t		sum = 0;
-	ssize_t		part;
-
-	/* We'd better have space to make a copy, in case we need to retry. */
-	if (iovcnt > PG_IOV_MAX)
-	{
-		errno = EINVAL;
-		return -1;
-	}
-
-	for (;;)
-	{
-		/* Write as much as we can. */
-		part = pwritev(fd, iov, iovcnt, offset);
-		if (part < 0)
-			return -1;
-
-#ifdef SIMULATE_SHORT_WRITE
-		part = Min(part, 4096);
-#endif
-
-		/* Count our progress. */
-		sum += part;
-		offset += part;
-
-		/* Step over iovecs that are done. */
-		while (iovcnt > 0 && iov->iov_len <= part)
-		{
-			part -= iov->iov_len;
-			++iov;
-			--iovcnt;
-		}
-
-		/* Are they all done? */
-		if (iovcnt == 0)
-		{
-			/* We don't expect the kernel to write more than requested. */
-			Assert(part == 0);
-			break;
-		}
-
-		/*
-		 * Move whatever's left to the front of our mutable copy and adjust
-		 * the leading

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-06 Thread Thomas Munro

On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy
 wrote:
> On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier  wrote:
> Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
> so that everyone can use it.
>
> > > Thoughts?
> >
> > Makes sense to me for the WAL segment pre-padding initialization, as
> > we still want to point to the beginning of the segment after we are
> > done with the pre-padding, and the code has an extra lseek().
>
> Thanks. I attached the v1 patch, please review it.

Hi Bharath,

+1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the
commit message should say that is happening.  Maybe the move should
even be in a separate patch (IMHO it's nice to separate mechanical
change patches from new logic/work patches).

+/*
+ * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given
+ * file of size total_sz in batches of size block_sz.
+ */
+ssize_t
+pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)

Hmm, why not give it a proper name that says it writes zeroes?

Size arguments around syscall-like facilities are usually size_t.

+memset(zbuffer.data, 0, block_sz);

I believe the modern fashion as of a couple of weeks ago is to tell
the compiler to zero-initialise.  Since it's a union you'd need
designated initialiser syntax, something like zbuffer = { .data = {0}
}?

+iov[i].iov_base = zbuffer.data;
+iov[i].iov_len = block_sz;

How can it be OK to use caller supplied block_sz, when
sizeof(zbuffer.data) is statically determined?  What is the point of
this argument?

-dir_data->lasterrno = errno;
+/* If errno isn't set, assume problem is no disk space */
+dir_data->lasterrno = errno ? errno : ENOSPC;

This coding pattern is used in places where we blame short writes on
lack of disk space without bothering to retry.  But the code used in
this patch already handles short writes: it always retries, until
eventually, if you really are out of disk space, you should get an
actual ENOSPC from the operating system.  So I don't think the
guess-it-must-be-ENOSPC technique applies here.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-06 Thread Bharath Rupireddy

On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier  wrote:
>
> On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote:
> > I noticed that dir_open_for_write() in walmethods.c uses write() for
> > WAL file initialization (note that this code is used by pg_receivewal
> > and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in
> > XLogFileInitInternal() to avoid partial writes. Do we need to fix
> > this?
>
> 0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/

Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
so that everyone can use it.

> > Thoughts?
>
> Makes sense to me for the WAL segment pre-padding initialization, as
> we still want to point to the beginning of the segment after we are
> done with the pre-padding, and the code has an extra lseek().

Thanks. I attached the v1 patch, please review it.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/


v1-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patch
Description: Binary data

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-06 Thread Michael Paquier

On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote:
> I noticed that dir_open_for_write() in walmethods.c uses write() for
> WAL file initialization (note that this code is used by pg_receivewal
> and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in
> XLogFileInitInternal() to avoid partial writes. Do we need to fix
> this?

0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/

> Thoughts?

Makes sense to me for the WAL segment pre-padding initialization, as
we still want to point to the beginning of the segment after we are
done with the pre-padding, and the code has an extra lseek().
--
Michael


signature.asc
Description: PGP signature

79 matches

Mail list logo