Re: O_DIRECT on macOS

2021-07-19 Thread Thomas Munro
On Tue, Jul 20, 2021 at 12:26 PM Tom Lane  wrote:
> > I can try that on the gcc farm in a bit.

Thanks!

> Hmm, it compiles cleanly, but something seems drastically wrong,
> because performance is just awful.  On the other hand, I don't
> know what sort of storage is underlying this instance, so maybe
> that's to be expected?

Ouch.  I assume this was without wal_method=minimal (or it'd have
reached the new code and failed completely, based on the pg_test_fsync
result).

> open_datasync  n/a*

I'm waiting for access, but I see from man pages that closed source
ZFS doesn't accept DIRECTIO_ON, so it may not be possible to see it
work on an all-ZFS system that you can't mount a new FS on.  Hmm.
Well, many OSes have file systems that can't do it (ext4 journal=data,
etc).  One problem is that we don't treat all OSes the same when
selecting wal_sync_method, even though O_DIRECT is complicated on many
OSes.  It would also be nice if the choice to use direct I/O were
independently controlled, and ... [trails off].  Alright, I'll leave
this on ice for now.




Re: O_DIRECT on macOS

2021-07-19 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> While I was here again, I couldn't resist trying to extend this to
>> Solaris, since it looked so easy.  I don't have access, but I tested
>> on Illumos by undefining O_DIRECT.  Thoughts?

> I can try that on the gcc farm in a bit.

Hmm, it compiles cleanly, but something seems drastically wrong,
because performance is just awful.  On the other hand, I don't
know what sort of storage is underlying this instance, so maybe
that's to be expected?  If I set fsync = off, the speed seems
comparable to what wrasse reports, but with fsync on it's like

test tablespace   ... ok87990 ms
parallel group (20 tests, in groups of 1):  boolean char name varchar text int2 
int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn 
regproc
 boolean  ... ok 3229 ms
 char ... ok 2758 ms
 name ... ok 2229 ms
 varchar  ... ok 7373 ms
 text ... ok  722 ms
 int2 ... ok  342 ms
 int4 ... ok 1303 ms
 int8 ... ok 1095 ms
 oid  ... ok 1086 ms
 float4   ... ok 6360 ms
 float8   ... ok 5224 ms
 bit  ... ok 6254 ms
 numeric  ... ok44304 ms
 txid ... ok  377 ms
 uuid ... ok 3946 ms
 enum ... ok33189 ms
 money... ok  622 ms
 rangetypes   ... ok17301 ms
 pg_lsn   ... ok  798 ms
 regproc  ... ok  145 ms

(I stopped running it at that point...)

Also, the results of pg_test_fsync seem wrong; it refuses to run
tests for the cases we're interested in:

$ pg_test_fsync 
5 seconds per test
DIRECTIO_ON supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  n/a*
fdatasync 8.324 ops/sec  120139 usecs/op
fsync 0.906 ops/sec  1103936 usecs/op
fsync_writethrough  n/a
open_sync  n/a*
* This file system and its mount options do not support direct
  I/O, e.g. ext4 in journaled mode.

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  n/a*
fdatasync 7.329 ops/sec  136449 usecs/op
fsync 0.788 ops/sec  1269258 usecs/op
fsync_writethrough  n/a
open_sync  n/a*
* This file system and its mount options do not support direct
  I/O, e.g. ext4 in journaled mode.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
 1 * 16kB open_sync write  n/a*
 2 *  8kB open_sync writes n/a*
 4 *  4kB open_sync writes n/a*
 8 *  2kB open_sync writes n/a*
16 *  1kB open_sync writes n/a*

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close  16.388 ops/sec   61020 usecs/op
write, close, fsync   9.084 ops/sec  110082 usecs/op

Non-sync'ed 8kB writes:
write 39855.686 ops/sec  25 usecs/op



regards, tom lane




Re: O_DIRECT on macOS

2021-07-19 Thread Tom Lane
Thomas Munro  writes:
> While I was here again, I couldn't resist trying to extend this to
> Solaris, since it looked so easy.  I don't have access, but I tested
> on Illumos by undefining O_DIRECT.  Thoughts?

I can try that on the gcc farm in a bit.

regards, tom lane




Re: O_DIRECT on macOS

2021-07-19 Thread Thomas Munro
On Tue, Jul 20, 2021 at 2:13 AM Tom Lane  wrote:
> Hmm ... we used to have to avoid putting #if constructs in the arguments
> of macros (such as StaticAssertStmt).  Maybe that's not a thing anymore
> with C99, and in any case this whole stanza is fairly platform-specific
> so we may not run into a compiler that complains.  But my hindbrain wants
> to see this done with separate statements, eg
>
> #if defined(O_CLOEXEC)
> StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
>  "PG_O_DIRECT collides with O_CLOEXEC");
> #endif

Ok, done.

While I was here again, I couldn't resist trying to extend this to
Solaris, since it looked so easy.  I don't have access, but I tested
on Illumos by undefining O_DIRECT.  Thoughts?
From 3016ede1dfd972badac65954d6e908f77e3c134b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 19 Jul 2021 21:47:16 +
Subject: [PATCH 1/2] Support direct I/O on Solaris.

Extend the change done for macOS by commit 2dbe8905 to cover Solaris.
Note that this doesn't affect Illumos (it gained O_DIRECT).

Discussion: https://postgr.es/m/CA%2BhUKG%2BADiyyHe0cun2wfT%2BSVnFVqNYPxoO6J9zcZkVO7%2BNGig%40mail.gmail.com
---
 src/backend/storage/file/fd.c | 19 +++
 src/bin/pg_test_fsync/pg_test_fsync.c | 27 +--
 src/include/storage/fd.h  |  9 ++---
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5d5e8ae94e..78ac2caa8f 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1057,11 +1057,13 @@ BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	int			fd;
 
 tryAgain:
-#ifdef PG_O_DIRECT_USE_F_NOCACHE
+#if defined(PG_O_DIRECT_USE_F_NOCACHE) || \
+	defined(PG_O_DIRECT_USE_DIRECTIO_ON)
 
 	/*
 	 * The value we defined to stand in for O_DIRECT when simulating it with
-	 * F_NOCACHE had better not collide with any of the standard flags.
+	 * an extra system call had better not collide with any of the standard
+	 * flags.
 	 */
 	StaticAssertStmt((PG_O_DIRECT &
 	  (O_APPEND |
@@ -1089,10 +1091,19 @@ tryAgain:
 
 	if (fd >= 0)
 	{
-#ifdef PG_O_DIRECT_USE_F_NOCACHE
+#if defined(PG_O_DIRECT_USE_F_NOCACHE) || \
+	defined(PG_O_DIRECT_USE_DIRECTIO_ON)
 		if (fileFlags & PG_O_DIRECT)
 		{
-			if (fcntl(fd, F_NOCACHE, 1) < 0)
+			int			rc;
+
+#if defined(PG_O_DIRECT_USE_F_NOCACHE)
+			rc = fcntl(fd, F_NOCACHE, 1);
+#endif
+#if defined(PG_O_DIRECT_USE_DIRECTIO_ON)
+			rc = directio(fd, DIRECTIO_ON);
+#endif
+			if (rc < 0)
 			{
 int			save_errno = errno;
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index fef31844fa..f5e3868c10 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -221,6 +221,8 @@ handle_args(int argc, char *argv[])
 	printf(_("O_DIRECT supported on this platform for open_datasync and open_sync.\n"));
 #elif defined(F_NOCACHE)
 	printf(_("F_NOCACHE supported on this platform for open_datasync and open_sync.\n"));
+#elif defined(DIRECTIO_ON)
+	printf(_("DIRECTIO_ON supported on this platform for open_datasync and open_sync.\n"));
 #else
 	printf(_("Direct I/O is not supported on this platform.\n"));
 #endif
@@ -271,14 +273,27 @@ open_direct(const char *path, int flags, mode_t mode)
 
 	fd = open(path, flags, mode);
 
-#if !defined(O_DIRECT) && defined(F_NOCACHE)
-	if (fd >= 0 && fcntl(fd, F_NOCACHE, 1) < 0)
+#if !defined(O_DIRECT) && \
+	(defined(F_NOCACHE) || \
+	 defined(DIRECTIO_ON))
+	if (fd >= 0)
 	{
-		int			save_errno = errno;
+		int			rc;
 
-		close(fd);
-		errno = save_errno;
-		return -1;
+#if defined(F_NOCACHE)
+		rc = fcntl(fd, F_NOCACHE, 1);
+#endif
+#if defined(DIRECTIO_ON)
+		rc = directio(fd, DIRECTIO_ON);
+#endif
+		if (rc < 0)
+		{
+			int			save_errno = errno;
+
+			close(fd);
+			errno = save_errno;
+			return -1;
+		}
 	}
 #endif
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2d843eb992..b04988f818 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,15 +82,18 @@ extern int	max_safe_fds;
 /*
  * O_DIRECT is not standard, but almost every Unix has it.  We translate it
  * to the appropriate Windows flag in src/port/open.c.  We simulate it with
- * fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper.  We use the name
- * PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good
- * idea on a Unix).
+ * extra calls on macOS and Solaris inside fd.c's open() wrapper.  We use the
+ * name PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a
+ * good idea on a Unix).
  */
 #if defined(O_DIRECT)
 #define		PG_O_DIRECT O_DIRECT
 #elif defined(F_NOCACHE)
 #define		PG_O_DIRECT 0x8000
 #define		PG_O_DIRECT_USE_F_NOCACHE
+#elif defined(DIRECTIO_ON)
+#define		PG_O_DIRECT 0x8000
+#define		PG_O_DIRECT_USE_DIRECTIO_ON
 #else
 #define		PG_O_DIRECT 0
 #endif
-- 
2.30.2

From 7c99892354500f0

Re: O_DIRECT on macOS

2021-07-19 Thread Tom Lane
Thomas Munro  writes:
> On Mon, Jul 19, 2021 at 4:42 PM Tom Lane  wrote:
>> prairiedog thinks that Assert is too optimistic about whether all
>> those flags exist.

> Fixed.

Hmm ... we used to have to avoid putting #if constructs in the arguments
of macros (such as StaticAssertStmt).  Maybe that's not a thing anymore
with C99, and in any case this whole stanza is fairly platform-specific
so we may not run into a compiler that complains.  But my hindbrain wants
to see this done with separate statements, eg

#if defined(O_CLOEXEC)
StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
 "PG_O_DIRECT collides with O_CLOEXEC");
#endif

regards, tom lane




Re: O_DIRECT on macOS

2021-07-19 Thread John Naylor
On Mon, Jul 19, 2021 at 12:55 AM Thomas Munro 
wrote:
>
> On Mon, Jul 19, 2021 at 4:42 PM Tom Lane  wrote:
> > prairiedog thinks that Assert is too optimistic about whether all
> > those flags exist.
>
> Fixed.
>
> (Huh, I received no -committers email for 2dbe8905.)

It didn't show up in the archives, either. Neither did your follow-up
04cad8f7bc.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: O_DIRECT on macOS

2021-07-18 Thread Thomas Munro
On Mon, Jul 19, 2021 at 4:42 PM Tom Lane  wrote:
> prairiedog thinks that Assert is too optimistic about whether all
> those flags exist.

Fixed.

(Huh, I received no -committers email for 2dbe8905.)




Re: O_DIRECT on macOS

2021-07-18 Thread Tom Lane
Thomas Munro  writes:
> Agreed.  Pushed!

prairiedog thinks that Assert is too optimistic about whether all
those flags exist.

regards, tom lane




Re: O_DIRECT on macOS

2021-07-18 Thread Thomas Munro
On Tue, Jul 13, 2021 at 1:56 PM Andres Freund  wrote:
> On 2021-07-13 13:25:50 +1200, Thomas Munro wrote:
> > I'm planning to go with that idea (#1), if there are no objections.
>
> The only other viable approach I see is to completely separate our
> internal flag representation from the OS representation and do the whole
> mapping inside fd.c - but that seems like a too big hammer right now.

Agreed.  Pushed!

For the record, Solaris has directio() that could be handled the same
way.  I'm not planning to look into that myself, but patches welcome.
Illumos (née OpenSolaris) got with the programme and added O_DIRECT.
Of our 10-or-so target systems I guess that'd leave just HPUX (judging
by an old man page found on the web) and OpenBSD with no direct I/O
support.




Re: O_DIRECT on macOS

2021-07-12 Thread Andres Freund
Hi,

On 2021-07-13 13:25:50 +1200, Thomas Munro wrote:
> On Mon, May 31, 2021 at 10:29 AM Thomas Munro  wrote:
> > For comparison, here is my sketch of idea #1.  I pick an arbitrary
> > value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear
> > of breaking other code that might see it and try to pass it into
> > open()... for all I know, it might happen to match OS-internal value
> > O_NASAL_DEMONS), and statically assert that it doesn't collide with
> > standard flags we're using, and I strip it out of the flags I pass in
> > to open().  As I said, a bit icky, but it's a tiny and localised
> > patch, which is nice.
> 
> I'm planning to go with that idea (#1), if there are no objections.

The only other viable approach I see is to completely separate our
internal flag representation from the OS representation and do the whole
mapping inside fd.c - but that seems like a too big hammer right now.

Greetings,

Andres Freund




Re: O_DIRECT on macOS

2021-07-12 Thread Thomas Munro
On Mon, May 31, 2021 at 10:29 AM Thomas Munro  wrote:
> For comparison, here is my sketch of idea #1.  I pick an arbitrary
> value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear
> of breaking other code that might see it and try to pass it into
> open()... for all I know, it might happen to match OS-internal value
> O_NASAL_DEMONS), and statically assert that it doesn't collide with
> standard flags we're using, and I strip it out of the flags I pass in
> to open().  As I said, a bit icky, but it's a tiny and localised
> patch, which is nice.

I'm planning to go with that idea (#1), if there are no objections.




Re: O_DIRECT on macOS

2021-05-30 Thread Thomas Munro
On Mon, May 31, 2021 at 4:19 AM Justin Pryzby  wrote:
> Should there be an "else" to warn/error in the case that "direct" is requested
> but not supported?

The way we use O_DIRECT currently is extremely minimal, it's just "if
you've got it, we'll use it, but otherwise not complain", and I wasn't
trying to change that yet, but you're right that if we add explicit
GUCs to turn on direct I/O for WAL and data files we should definitely
not let you turn them on if we can't do it.




Re: O_DIRECT on macOS

2021-05-30 Thread Thomas Munro
On Mon, May 31, 2021 at 8:12 AM Andres Freund  wrote:
> On 2021-05-30 16:39:48 +1200, Thomas Munro wrote:
> > I thought about a few different ways to encapsulate this API
> > difference in PostgreSQL, and toyed with two:
> >
> > 1.  We could define our own fake O_DIRECT flag, and translate that to
> > the right thing inside BasicOpenFilePerm().  That seems a bit icky.
> > We'd have to be careful not to collide with system defined flags and
> > worry about changes.  We do that sort of thing for Windows, though
> > that's a bit different, there we translate *all* the flags from
> > POSIXesque to Windowsian.
> >
> > 2.  We could make an extended BasicOpenFilePerm() variant that takes a
> > separate boolean parameter for direct, so that we don't have to hijack
> > any flag space, but now we need new interfaces just to tolerate a
> > rather niche system.
>
> I don't think 2) really covers the problem on its own. It's fine for
> things that directly use BasicOpenFilePerm(), but what about "virtual
> file descriptors" (PathNameOpenFile())? I.e. what md.c et al use? There
> we need to store the fact that we want non-buffered IO as part of the
> vfd, otherwise we'll loose that information when re-opening the file
> later.

Right, a bit more API perturbation is required to traffic the separate
flags around for VFDs, which is all a bit unpleasant for a feature
that most people don't care about.

For comparison, here is my sketch of idea #1.  I pick an arbitrary
value to use as PG_O_DIRECT (I don't want to define O_DIRECT for fear
of breaking other code that might see it and try to pass it into
open()... for all I know, it might happen to match OS-internal value
O_NASAL_DEMONS), and statically assert that it doesn't collide with
standard flags we're using, and I strip it out of the flags I pass in
to open().  As I said, a bit icky, but it's a tiny and localised
patch, which is nice.

I also realised that it probably wasn't right to raise an ERROR, so in
this version I return -1 when fcntl() fails.


0001-Support-direct-I-O-on-macOS--idea1.patch
Description: Binary data


Re: O_DIRECT on macOS

2021-05-30 Thread Andres Freund
Hi,

Thanks for starting the discussion on this!

On 2021-05-30 16:39:48 +1200, Thomas Munro wrote:
> I thought about a few different ways to encapsulate this API
> difference in PostgreSQL, and toyed with two:
> 
> 1.  We could define our own fake O_DIRECT flag, and translate that to
> the right thing inside BasicOpenFilePerm().  That seems a bit icky.
> We'd have to be careful not to collide with system defined flags and
> worry about changes.  We do that sort of thing for Windows, though
> that's a bit different, there we translate *all* the flags from
> POSIXesque to Windowsian.
> 
> 2.  We could make an extended BasicOpenFilePerm() variant that takes a
> separate boolean parameter for direct, so that we don't have to hijack
> any flag space, but now we need new interfaces just to tolerate a
> rather niche system.

I don't think 2) really covers the problem on its own. It's fine for
things that directly use BasicOpenFilePerm(), but what about "virtual
file descriptors" (PathNameOpenFile())? I.e. what md.c et al use? There
we need to store the fact that we want non-buffered IO as part of the
vfd, otherwise we'll loose that information when re-opening the file
later.

Greetings,

Andres Freund




Re: O_DIRECT on macOS

2021-05-30 Thread Justin Pryzby
On Sun, May 30, 2021 at 04:39:48PM +1200, Thomas Munro wrote:

> +BasicOpenFilePermDirect(const char *fileName, int fileFlags, mode_t fileMode,
> +   bool direct)
> ...
> +#if !defined(O_DIRECT) && defined(F_NOCACHE)
> +   /* macOS requires an extra step. */
> +   if (direct && fcntl(fd, F_NOCACHE, 1) < 0)
> +   {
> +   int save_errno = errno;
> +
> +   close(fd);
> +   errno = save_errno;
> +   ereport(ERROR,
> +   (errcode_for_file_access(),
> +errmsg("could not disable kernel file 
> caching for file \"%s\": %m",
> +   fileName)));
> +   }
> +#endif

Should there be an "else" to warn/error in the case that "direct" is requested
but not supported?

-- 
Justin




O_DIRECT on macOS

2021-05-29 Thread Thomas Munro
Hi,

IRIX gave the world O_DIRECT, and then every Unix I've used followed
their lead except Apple's, which gave the world fcntl(fd, F_NOCACHE,
1).  From what I could find in public discussion, this API difference
may stem from the caching policy being controlled at the per-file
(vnode) level in older macOS (and perhaps ancestors), but since 10.4
it's per file descriptor, so approximately like O_DIRECT on other
systems.  The precise effects and constraints of O_DIRECT/F_NOCACHE
are different across operating systems and file systems in some subtle
and not-so-subtle ways, but the general concept is the same: try to
avoid buffering.

I thought about a few different ways to encapsulate this API
difference in PostgreSQL, and toyed with two:

1.  We could define our own fake O_DIRECT flag, and translate that to
the right thing inside BasicOpenFilePerm().  That seems a bit icky.
We'd have to be careful not to collide with system defined flags and
worry about changes.  We do that sort of thing for Windows, though
that's a bit different, there we translate *all* the flags from
POSIXesque to Windowsian.

2.  We could make an extended BasicOpenFilePerm() variant that takes a
separate boolean parameter for direct, so that we don't have to hijack
any flag space, but now we need new interfaces just to tolerate a
rather niche system.

Here's a draft patch like #2, just for discussion.  Better ideas?

The reason I want to get direct I/O working on this "client" OS is
because the AIO project will propose to use direct I/O for the buffer
pool as an option, and I would like Macs to be able to do that
primarily for the sake of developers trying out the patch set.  Based
on memories from the good old days of attending conferences, a decent
percentage of PostgreSQL developers are on Macs.

As it stands, the patch only actually has any effect if you set
wal_level=minimal and max_wal_senders=0, which is a configuration that
I guess almost no-one uses.  Otherwise xlog.c assumes that the
filesystem is going to be used for data exchange with replication
processes (something we should replace with WAL buffers in shmem some
time soon) so for now it's better to keep the data in page cache since
it'll be accessed again soon.

Unfortunately, this change makes pg_test_fsync show a very slightly
lower number for open_data_sync on my ancient Intel Mac, but
pg_test_fsync isn't really representative anymore since minimal
logging is by now unusual (I guess pg_test_fsync would ideally do the
test with and without direct to make that clearer).  Whether this is a
good option for the WAL is separate from whether it's a good option
for relation data (ie a way to avoid large scale double buffering, but
have new, different problems), and later patches will propose new
separate GUCs to control that.


0001-Support-direct-I-O-on-macOS.patch
Description: Binary data