Hi,

As a card-carrying Unix hacker, I think it'd be great to remove the
differences between platforms if possible using newer Windows
facilities, so everything just works the way we expect.  Two things
that stopped progress on that front in the past were (1) uncertainty
about OS versioning, fixed in v16 with an EOL purge, and (2)
uncertainty about what the new interfaces really do, due to lack of
good ways to test, which I'd like to address here.

My goals in this thread:

 * introduce a pattern/idiom for TAP-testing of low level C code
without a database server
 * demonstrate the behaviour of our filesystem porting code with full coverage
 * fix reported bugs in my recent symlink changes with coverage
 * understand the new "POSIX semantics" changes in Windows 10
 * figure out what our policy should be on "POSIX semantics"

For context, we have a bunch of stuff under src/port to provide
POSIX-like implementations of:

  open()*
  fstat(), stat()*, lstat()*
  link(), unlink()*, rename()*
  symlink(), readlink()
  opendir(), readdir(), closedir()
  pg_pwrite(), pg_pread()

These call equivalent Windows system interfaces so we can mostly just
write code that assumes the whole world is a POSIX.  Some of them also
deal with three special aspects of Windows file handles, which
occasionally cause trouble:

1.  errno == EACCES && GetLastError() == ERROR_SHARING_VIOLATION:
This happens if you try to access a file that has been opened by
without FILE_SHARE_ flags to allow concurrent access.  While our own
open() wrapper uses those flags, other programs might not.  The
wrapper functions marked with an asterix above deal with this
condition by sleeping or retrying for 10 or 30 seconds, in the hope
that the external program goes away.  AFAIK this problem will always
be with us.

2.  errno == EACCES && GetLastNtStatus() == STATUS_DELETE_PENDING:
This happens if you try to access a directory entry that is scheduled
for asynchronous unlink, but is still present until all handles to the
file are closed.  The wrapper functions above deal with this in
various different ways:

 open() without O_CREAT: -> ENOENT, so we can pretend that unlink()
calls are synchronous
 open() with O_CREAT: -> EEXIST, the zombie dirent wins
 stat(), lstat(): -> ENOENT
 unlink(), rename(): retry, same as we do for ERROR_SHARING_VIOLATION,
until timeout or asynchonous unlink completes (this may have been
unintentional due to same errno?)

3.  errno == EACCESS && <not sure>:  You can't MoveFileEx() on top of
a file that someone has open.

In Windows 10, a new "POSIX semantics" mode was added.  Yippee!
Victor Spirin proposed[1] that we use it several commitfests ago.
Interestingly, on some systems it is already partially activated
without any change on our part.  That is, on some systems, unlink()
works synchronously (when the call returns, the dirent is gone, even
if someone else has the file open, just like Unix).  Sounds great, but
in testing different Windows systems I have access to using the
attached test suite I found three different sets of behaviour:

 A) Using Windows unlink() and MoveFileEx() on Server 2019 (CI) I get
traditional STATUS_DELETE_PENDING problems
 B) Using Windows unlink()/MoveFileEx() on Windows 10 Home (local VM)
I get mostly POSIX behaviour, except problem (3) above, which you can
see in my test suite
 C) Using Windows new SetFileInformationByHandle() calls with explicit
request for POSIX semantics (this syscall is something like fcntl(), a
kitchen sink kernel interface, and is what unlink() and MoveFileEx()
and others things are built on, but if you do it yourself you can
reach more flags) I get full POSIX behaviour according to my test
suite, i.e. agreement with FreeBSD and Linux for the dirent-related
cases I've though about so far, on both of those Windows systems

It sounds like we want option C, as Victor proposed, but I'm not sure
what happens if you try to use it on a non-NTFS filesystem (does it
quietly fall back to non-POSIX semantics, or fail, or do all file
systems now support this?).  I'm also not sure if we really support
running on a non-NTFS filesystem, not being a Windows user myself.

So the questions I have are:

 * any thoughts on this C TAP testing system?
 * has anyone got a non-EOL'd OS version where this test suite fails?
 * has anyone got a relevant filesystem where this fails?  which way
do ReFS and SMB go?  do the new calls in 0010 just fail, and if so
with which code (ie could we add our own fallback path)?
 * which filesystems do we even claim to support?
 * if we switched to explicitly using POSIX-semantics like in the 0010
patch, I assume there would be nothing left in the build farm or CI
that tests the non-POSIX code paths (either in these tests or in the
real server code), and the non-POSIX support would decay into
non-working form pretty quickly
 * if there are any filesystems that don't support POSIX-semantics,
would we want to either (1) get such a thing into the build farm so
it's tested or (2) de-support non-POSIX-semantics filesystems by
edict, and drop a lot of code and problems that everyone hates?

Thanks to Andres for the 0002 meson support.  I have not yet written
autoconf support; I guess I'd have to do that.

You can run this with eg "meson test --suite=port -v" on any OS.  The
first test result tells you whether it detected POSIX semantics or
not, which affects later testing.  Unix systems are always detected as
POSIX, Windows 10+ systems are always POSIX if you apply the final
patch, but could be either depending on your Windows version if you
don't, except they still have the quirk about problem (3) above for
some reason, which is why the relevant test changes in the final
patch.

(Note: The 0010 patch fails on the CI CompilerCheck cross build, which
I think has to do with wanting _WIN32_WINNT >= 0xA02 to see some
definitions, not looked into yet, and I haven't thought much about
Cygwin, but I expect they turn on all the POSIX things under the
covers too.)

[1] https://commitfest.postgresql.org/40/3347/
From 463434b4162003fd3676b2cdd99b257075ea63a4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 11 Oct 2022 17:26:02 +1300
Subject: [PATCH 01/10] Add suite of macros for writing TAP tests in C.

Historically we had to load test modules into a PostgreSQL backend or
write an ad-hoc standalone program to test C code.  Let's provide a
convenient way to write stand-alone tests for low-level C code.

This commit defines a small set of macros that spit out results in TAP
format.  These can be improved and extended as required.
---
 src/include/lib/pg_tap.h | 138 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)
 create mode 100644 src/include/lib/pg_tap.h

diff --git a/src/include/lib/pg_tap.h b/src/include/lib/pg_tap.h
new file mode 100644
index 0000000000..3da46ac5d5
--- /dev/null
+++ b/src/include/lib/pg_tap.h
@@ -0,0 +1,138 @@
+/*
+ * Simple macros for writing tests in C that print results in TAP format,
+ * as consumed by "prove".
+ *
+ * Specification for the output format: https://testanything.org/
+ */
+
+#ifndef PG_TAP_H
+#define PG_TAP_H
+
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+/* Counters are global, so we can break our tests into multiple functions. */
+static int	pg_test_count;
+static int	pg_fail_count;
+static int	pg_todo_count;
+
+/*
+ * Require an expression to be true.  Used for set-up steps that are not
+ * reported as a test.
+ */
+#define PG_REQUIRE(expr) \
+if (!(expr)) { \
+	printf("Bail out! requirement (" #expr ") failed at %s:%d\n", \
+		__FILE__, __LINE__); \
+	exit(EXIT_FAILURE); \
+}
+
+/*
+ * Like PG_REQUIRE, but log strerror(errno) before bailing.
+ */
+#define PG_REQUIRE_SYS(expr) \
+if (!(expr)) { \
+	printf("Bail out! requirement (" #expr ") failed at %s:%d, error: %s\n", \
+		__FILE__, __LINE__, strerror(errno)); \
+	exit(EXIT_FAILURE); \
+}
+
+/*
+ * Test that an expression is true.  An optional message can be provided,
+ * defaulting to the expression itself if not provided.
+ */
+#define PG_EXPECT(expr, ...) \
+do { \
+	const char *messages[] = {#expr, __VA_ARGS__}; \
+	const char *message = messages[lengthof(messages) - 1]; \
+	pg_test_count++; \
+	if (expr) { \
+		printf("ok %d - %s\n", pg_test_count, message); \
+	} else { \
+		if (pg_todo_count == 0) \
+			pg_fail_count++; \
+		printf("not ok %d - %s (at %s:%d)%s\n", pg_test_count, \
+			message, __FILE__, __LINE__, \
+			pg_todo_count > 0 ? " # TODO" : ""); \
+	} \
+} while (0)
+
+/*
+ * Like PG_EXPECT(), but also log strerror(errno) on failure.
+ */
+#define PG_EXPECT_SYS(expr, ...) \
+do { \
+	const char *messages[] = {#expr, __VA_ARGS__}; \
+	const char *message = messages[lengthof(messages) - 1]; \
+	pg_test_count++; \
+	if (expr) { \
+		printf("ok %d - %s\n", pg_test_count, message); \
+	} else { \
+		if (pg_todo_count == 0) \
+			pg_fail_count++; \
+		printf("not ok %d - %s (at %s:%d), error: %s%s\n", pg_test_count, \
+			message, __FILE__, __LINE__, strerror(errno), \
+			pg_todo_count > 0 ? " # TODO" : ""); \
+	} \
+} while (0)
+
+
+/*
+ * Test that one int expression is equal to another, logging the values if not.
+ */
+#define PG_EXPECT_EQ(expr1, expr2, ...) \
+do { \
+	const char *messages[] = {#expr1 " == " #expr2, __VA_ARGS__}; \
+	const char *message = messages[lengthof(messages) - 1]; \
+	int expr1_val = (expr1); \
+	int expr2_val = (expr2); \
+	pg_test_count++; \
+	if (expr1_val == expr2_val) { \
+		printf("ok %d - %s\n", pg_test_count, message); \
+	} else { \
+		if (pg_todo_count == 0) \
+			pg_fail_count++; \
+		printf("not ok %d - %s: %d != %d (at %s:%d)%s\n", pg_test_count, \
+			message, expr1_val, expr2_val, __FILE__, __LINE__, \
+			pg_todo_count > 0 ? " # TODO" : ""); \
+	} \
+} while (0)
+
+/*
+ * Test that one C string expression is equal to another, logging the values if
+ * not.
+ */
+#define PG_EXPECT_EQ_STR(expr1, expr2, ...) \
+do { \
+	const char *messages[] = {#expr1 " matches " #expr2, __VA_ARGS__}; \
+	const char *message = messages[lengthof(messages) - 1]; \
+	const char *expr1_val = (expr1); \
+	const char *expr2_val = (expr2); \
+	pg_test_count++; \
+	if (strcmp(expr1_val, expr2_val) == 0) { \
+		printf("ok %d - %s\n", pg_test_count, message); \
+	} else { \
+		if (pg_todo_count == 0) \
+			pg_fail_count++; \
+		printf("not ok %d - %s: \"%s\" vs \"%s\" (at %s:%d) %s\n", \
+			pg_test_count, \
+			message, expr1_val, expr2_val, __FILE__, __LINE__, \
+			pg_todo_count > 0 ? " # TODO" : ""); \
+	} \
+} while (0)
+
+/*
+ * The main function of a test program should begin and end with these
+ * functions.
+ */
+#define PG_BEGIN_TESTS() \
+	setbuf(stdout, NULL);		/* disable buffering for Windows */
+
+#define PG_END_TESTS() printf("1..%d\n", pg_test_count); \
+	return EXIT_SUCCESS
+
+#define PG_BEGIN_TODO() pg_todo_count++
+#define PG_END_TODO() pg_todo_count--
+
+#endif
-- 
2.37.3

From a6227695f0e8c2effb5b62590e02726e2f8cae19 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Tue, 11 Oct 2022 10:49:15 -0700
Subject: [PATCH 02/10] meson: Add infrastructure for TAP tests written in C.

Allow t/XXX.c files to be used as t/XXX.pl files normally are.

Author: Andres Freund <and...@anarazel.de>
---
 meson.build                           | 43 +++++++++++++++++++++++----
 src/interfaces/libpq/test/meson.build |  8 ++---
 src/tools/testwrap                    | 11 +++++--
 3 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/meson.build b/meson.build
index 2d225f706d..ada774da84 100644
--- a/meson.build
+++ b/meson.build
@@ -2560,6 +2560,10 @@ default_bin_args = default_target_args + {
   'install_rpath': ':'.join(bin_install_rpaths),
 }
 
+test_bin_args = default_target_args + {
+  'install': false,
+}
+
 
 
 # Helper for exporting a limited number of symbols
@@ -2919,7 +2923,14 @@ foreach test_dir : tests
     endif
 
     t = test_dir[kind]
+    env = test_env
 
+    # Add environment vars from the test specification
+    foreach name, value : t.get('env', {})
+      env.set(name, value)
+    endforeach
+
+    # pg_regress style tests
     if kind in ['regress', 'isolation', 'ecpg']
       if kind == 'regress'
         runner = pg_regress
@@ -2953,7 +2964,6 @@ foreach test_dir : tests
         test_command += t.get('sql', [])
       endif
 
-      env = test_env
       env.prepend('PATH', temp_install_bindir, test_dir['bd'])
 
       test_kwargs = {
@@ -2974,6 +2984,8 @@ foreach test_dir : tests
       )
 
       testport += 1
+
+    # perl tap tests
     elif kind == 'tap'
       if not tap_tests_enabled
         continue
@@ -2987,13 +2999,8 @@ foreach test_dir : tests
 
       # Add temporary install, the build directory for non-installed binaries and
       # also test/ for non-installed test binaries built separately.
-      env = test_env
       env.prepend('PATH', temp_install_bindir, test_dir['bd'], test_dir['bd'] / 'test')
 
-      foreach name, value : t.get('env', {})
-        env.set(name, value)
-      endforeach
-
       test_kwargs = {
         'protocol': 'tap',
         'suite': [test_dir['name']],
@@ -3022,6 +3029,30 @@ foreach test_dir : tests
           ],
         )
       endforeach
+
+    # tap tests using C executables
+    elif kind == 'exec_tap'
+      # Don't have a dependency on perl tap test infrastructure, so we can
+      # test even when not tap_test_enabled.
+      test_kwargs = {
+        'protocol': 'tap',
+        'suite': [test_dir['name']],
+        'timeout': 1000,
+        'env': env,
+      } + t.get('test_kwargs', {})
+
+      foreach onetap : t['tests']
+        test_name = onetap.name()
+        test(test_dir['name'] / test_name,
+          python,
+          kwargs: test_kwargs,
+          depends: test_deps + t.get('deps', []) + [onetap],
+          args: testwrap_base + [
+            '--testname', test_name,
+            '--', onetap.full_path(),
+          ],
+        )
+      endforeach
     else
       error('unknown kind @0@ of test in @1@'.format(kind, test_dir['sd']))
     endif
diff --git a/src/interfaces/libpq/test/meson.build b/src/interfaces/libpq/test/meson.build
index 017f729d43..06e026e400 100644
--- a/src/interfaces/libpq/test/meson.build
+++ b/src/interfaces/libpq/test/meson.build
@@ -11,9 +11,7 @@ endif
 executable('libpq_uri_regress',
   libpq_uri_regress_sources,
   dependencies: [frontend_code, libpq],
-  kwargs: default_bin_args + {
-    'install': false,
-  }
+  kwargs: test_bin_args,
 )
 
 
@@ -30,7 +28,5 @@ endif
 executable('libpq_testclient',
   libpq_testclient_sources,
   dependencies: [frontend_code, libpq],
-  kwargs: default_bin_args + {
-    'install': false,
-  }
+  kwargs: default_bin_args,
 )
diff --git a/src/tools/testwrap b/src/tools/testwrap
index 7a64fe76a2..5ca49b1fac 100755
--- a/src/tools/testwrap
+++ b/src/tools/testwrap
@@ -32,9 +32,16 @@ os.chdir(args.srcdir)
 # mark test as having started
 open(os.path.join(testdir, 'test.start'), 'x')
 
+testdatadir = os.path.join(testdir, 'data')
+testlogdir = os.path.join(testdir, 'log')
+
+# pre-create dirs so that the test's infrastructure doesn't have to
+os.makedirs(testdatadir)
+os.makedirs(testlogdir)
+
 env_dict = {**os.environ,
-            'TESTDATADIR': os.path.join(testdir, 'data'),
-            'TESTLOGDIR': os.path.join(testdir, 'log')}
+            'TESTDATADIR': testdatadir,
+            'TESTLOGDIR': testlogdir}
 
 sp = subprocess.run(args.test_command, env=env_dict)
 
-- 
2.37.3

From 309577d0c40a12dfe2c3a19cca322f69a81bce45 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 12 Oct 2022 16:34:09 +1300
Subject: [PATCH 03/10] Fix symlink() errno in Windows replacement code.

Ancient bug noticed while adding tests for these functions.
---
 src/port/dirmod.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index ae6301dd6c..51c9bded8f 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -197,7 +197,10 @@ pgsymlink(const char *oldpath, const char *newpath)
 						   FILE_FLAG_OPEN_REPARSE_POINT | FILE_FLAG_BACKUP_SEMANTICS, 0);
 
 	if (dirhandle == INVALID_HANDLE_VALUE)
+	{
+		_dosmaperr(GetLastError());
 		return -1;
+	}
 
 	/* make sure we have an unparsed native win32 path */
 	if (memcmp("\\??\\", oldpath, 4) != 0)
@@ -230,8 +233,11 @@ pgsymlink(const char *oldpath, const char *newpath)
 						 0, 0, &len, 0))
 	{
 		LPSTR		msg;
+		int			save_errno;
+
+		_dosmaperr(GetLastError());
+		save_errno = errno;
 
-		errno = 0;
 		FormatMessage(FORMAT_MESSAGE_ALLOCATE_BUFFER |
 					  FORMAT_MESSAGE_IGNORE_INSERTS |
 					  FORMAT_MESSAGE_FROM_SYSTEM,
@@ -251,6 +257,9 @@ pgsymlink(const char *oldpath, const char *newpath)
 
 		CloseHandle(dirhandle);
 		RemoveDirectory(newpath);
+
+		errno = save_errno;
+
 		return -1;
 	}
 
-- 
2.37.3

From eda707444682c138f7b182c94198ae04bb8b84bd Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 12 Oct 2022 17:22:53 +1300
Subject: [PATCH 04/10] Fix readlink() return value on Windows.

It accidentally returned a value that counted the trailing nul
terminator.

This is an ancient bug, but it's benign.
---
 src/port/dirmod.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 51c9bded8f..5881024929 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -359,6 +359,9 @@ pgreadlink(const char *path, char *buf, size_t size)
 		return -1;
 	}
 
+	/* r includes the nul terminator */
+	r -= 1;
+
 	/*
 	 * If the path starts with "\??\", which it will do in most (all?) cases,
 	 * strip those out.
-- 
2.37.3

From ae795596e02fe4dd8a52dfc11bd026f4b1a5e343 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 11 Oct 2022 17:26:02 +1300
Subject: [PATCH 05/10] Add tests for Windows filesystem code in src/port.

The wrappers we use for open(), unlink() etc had no direct testing.
These tests demonstrate the behavior of our Windows porting layer, with
two variants of underlying Windows behavior that are known to exist in
current versions of the operating system.

Since some of the functions contain internal sleep/retry loops, we need
a way to guarantee deterministic tests and perform asynchronous tasks,
so we also make the loop counts adjustable at runtime (only for use by
tests) and make pg_usleep() interruptable by Windows APC.
---
 meson.build                 |   1 +
 src/include/port.h          |   2 +
 src/port/dirmod.c           |   7 +-
 src/port/open.c             |   5 +-
 src/port/pgsleep.c          |   2 +-
 src/port/t/001_filesystem.c | 828 ++++++++++++++++++++++++++++++++++++
 src/port/t/meson.build      |  15 +
 7 files changed, 856 insertions(+), 4 deletions(-)
 create mode 100644 src/port/t/001_filesystem.c
 create mode 100644 src/port/t/meson.build

diff --git a/meson.build b/meson.build
index ada774da84..ca9b3ac43c 100644
--- a/meson.build
+++ b/meson.build
@@ -2777,6 +2777,7 @@ subdir('src')
 subdir('contrib')
 
 subdir('src/test')
+subdir('src/port/t')
 subdir('src/interfaces/libpq/test')
 subdir('src/interfaces/ecpg/test')
 
diff --git a/src/include/port.h b/src/include/port.h
index 69d8818d61..f81f8de2a3 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -269,6 +269,7 @@ extern int	pclose_check(FILE *stream);
 /*
  *	Win32 doesn't have reliable rename/unlink during concurrent access.
  */
+extern PGDLLEXPORT int pgwin32_dirmod_loops;
 extern int	pgrename(const char *from, const char *to);
 extern int	pgunlink(const char *path);
 
@@ -307,6 +308,7 @@ extern bool rmtree(const char *path, bool rmtopdir);
  * passing of other special options.
  */
 #define		O_DIRECT	0x80000000
+extern PGDLLEXPORT int pgwin32_open_handle_loops;
 extern HANDLE pgwin32_open_handle(const char *, int, bool);
 extern int	pgwin32_open(const char *, int,...);
 extern FILE *pgwin32_fopen(const char *, const char *);
diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 5881024929..3e1c78fae6 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -41,6 +41,9 @@
 
 #if defined(WIN32) || defined(__CYGWIN__)
 
+/* Externally visable only to allow testing. */
+int pgwin32_dirmod_loops = 100;
+
 /*
  *	pgrename
  */
@@ -84,7 +87,7 @@ pgrename(const char *from, const char *to)
 			return -1;
 #endif
 
-		if (++loops > 100)		/* time out after 10 sec */
+		if (++loops > pgwin32_dirmod_loops)		/* time out after 10 sec */
 			return -1;
 		pg_usleep(100000);		/* us */
 	}
@@ -137,7 +140,7 @@ pgunlink(const char *path)
 	{
 		if (errno != EACCES)
 			return -1;
-		if (++loops > 100)		/* time out after 10 sec */
+		if (++loops > pgwin32_dirmod_loops)		/* time out after 10 sec */
 			return -1;
 		pg_usleep(100000);		/* us */
 	}
diff --git a/src/port/open.c b/src/port/open.c
index fd4faf604e..279a0ac08d 100644
--- a/src/port/open.c
+++ b/src/port/open.c
@@ -25,6 +25,9 @@
 #include <assert.h>
 #include <sys/stat.h>
 
+/* This is exported only to support testing. */
+int pgwin32_open_handle_loops = 300;
+
 static int
 openFlagsToCreateFileFlags(int openFlags)
 {
@@ -118,7 +121,7 @@ pgwin32_open_handle(const char *fileName, int fileFlags, bool backup_semantics)
 						 errhint("You might have antivirus, backup, or similar software interfering with the database system.")));
 #endif
 
-			if (loops < 300)
+			if (loops < pgwin32_open_handle_loops)
 			{
 				pg_usleep(100000);
 				loops++;
diff --git a/src/port/pgsleep.c b/src/port/pgsleep.c
index 03f0fac07b..9a37975c5d 100644
--- a/src/port/pgsleep.c
+++ b/src/port/pgsleep.c
@@ -53,7 +53,7 @@ pg_usleep(long microsec)
 		delay.tv_usec = microsec % 1000000L;
 		(void) select(0, NULL, NULL, NULL, &delay);
 #else
-		SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
+		SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), TRUE);
 #endif
 	}
 }
diff --git a/src/port/t/001_filesystem.c b/src/port/t/001_filesystem.c
new file mode 100644
index 0000000000..c2832b0a20
--- /dev/null
+++ b/src/port/t/001_filesystem.c
@@ -0,0 +1,828 @@
+/*
+ * Tests for our partial implementations of POSIX-style filesystem APIs, for
+ * Windows.
+ */
+
+#include "c.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/stat.h>
+
+#include "lib/pg_tap.h"
+#include "port/pg_iovec.h"
+
+/*
+ * Make a path under the tmp_check directory.  TESTDATADIR is expected to
+ * contain an absolute path.
+ */
+static void
+make_path(char *buffer, const char *name)
+{
+	const char *directory;
+
+	directory = getenv("TESTDATADIR");
+	PG_REQUIRE(directory);
+
+	snprintf(buffer, MAXPGPATH, "%s/%s", directory, name);
+}
+
+/*
+ * Check if a directory contains a given entry, according to readdir().
+ */
+static bool
+directory_contains(const char *directory_path, const char *name)
+{
+	char		directory_full_path[MAXPGPATH];
+	DIR		   *dir;
+	struct dirent *de;
+	bool		saw_name = false;
+
+	make_path(directory_full_path, directory_path);
+	PG_REQUIRE_SYS((dir = opendir(directory_full_path)));
+	errno = 0;
+	while ((de = readdir(dir)))
+	{
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+		if (strcmp(de->d_name, name) == 0)
+			saw_name = true;
+	}
+	PG_REQUIRE_SYS(errno == 0);
+	PG_REQUIRE_SYS(closedir(dir) == 0);
+
+	return saw_name;
+}
+
+#ifdef WIN32
+/*
+ * Make all slashes lean one way, to normalize paths for comparisons on Windows.
+ */
+static void
+normalize_slashes(char *path)
+{
+	while (*path)
+	{
+		if (*path == '/')
+			*path = '\\';
+		path++;
+	}
+}
+
+typedef struct apc_trampoline_data
+{
+	HANDLE		timer;
+	void		(*function) (void *);
+	void	   *data;
+}			apc_trampoline_data;
+
+static void CALLBACK
+apc_trampoline(void *data, DWORD low, DWORD high)
+{
+	apc_trampoline_data *x = data;
+
+	x->function(x->data);
+	CloseHandle(x->timer);
+	free(data);
+}
+
+/*
+ * Schedule a Windows APC callback.  The pg_usleep() call inside the
+ * sleep/retry loops in the wrappers under test is interruptible by APCs,
+ * causing it to run the procedure and return early.
+ */
+static void
+run_async_procedure_after_delay(void (*p) (void *), void *data, int milliseconds)
+{
+	LARGE_INTEGER delay_time = {.LowPart = -10 * milliseconds};
+	apc_trampoline_data *x;
+	HANDLE		timer;
+
+	timer = CreateWaitableTimer(NULL, false, NULL);
+	PG_REQUIRE(timer);
+
+	x = malloc(sizeof(*x));
+	PG_REQUIRE_SYS(x);
+	x->timer = timer;
+	x->function = p;
+	x->data = data;
+
+	PG_REQUIRE(SetWaitableTimer(timer, &delay_time, 0, apc_trampoline, x, false));
+}
+
+static void
+close_fd(void *data)
+{
+	int		   *fd = data;
+
+	PG_REQUIRE_SYS(close(*fd) == 0);
+}
+
+static void
+close_handle(void *data)
+{
+	HANDLE		handle = data;
+
+	PG_REQUIRE(CloseHandle(handle));
+}
+#endif
+
+/*
+ * Tests of directory entry manipulation.
+ */
+static void
+filesystem_metadata_tests(void)
+{
+	int			fd;
+	int			fd2;
+	char		path[MAXPGPATH];
+	char		path2[MAXPGPATH];
+	char		path3[MAXPGPATH];
+	struct stat statbuf;
+	ssize_t		size;
+	bool		have_posix_unlink_semantics;
+#ifdef WIN32
+	HANDLE		handle;
+#endif
+
+	/*
+	 * Current versions of Windows 10 have "POSIX semantics" when unlinking
+	 * files, meaning that unlink() and rename() remove the directory entry
+	 * synchronously, just like Unix.  At least some server OSes still seem to
+	 * have the traditional Windows behavior, where directory entries remain
+	 * in STATUS_DELETE_PENDING state, visible but unopenable, until all file
+	 * handles are closed.  We have a lot of code paths to deal with the older
+	 * asynchronous behavior, and tests for those here.  Before we go further,
+	 * determine which behavior to expect.  Behavior may also vary on non-NTFS
+	 * filesystems.
+	 */
+	make_path(path, "test.txt");
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_REQUIRE_SYS(fd >= 0);
+	PG_REQUIRE_SYS(unlink(path) == 0);
+	fd2 = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_REQUIRE_SYS(fd2 >= 0 || errno == EEXIST);
+	if (fd2 >= 0)
+	{
+		/* Expected behavior on Unix and some Windows 10 releases. */
+		PG_EXPECT_SYS(unlink(path) == 0, "POSIX unlink semantics detected: cleaning up");
+		have_posix_unlink_semantics = true;
+		PG_REQUIRE_SYS(close(fd2) == 0);
+	}
+	else
+	{
+		/* Our open wrapper fails with EEXIST for O_EXCL in this case. */
+		PG_EXPECT_SYS(errno == EEXIST,
+					  "traditional Windows unlink semantics detected: O_EXCL -> EEXIST");
+		have_posix_unlink_semantics = false;
+	}
+	PG_REQUIRE_SYS(close(fd) == 0);
+
+	/* Set up test directory structure. */
+
+	make_path(path, "dir1");
+	PG_REQUIRE_SYS(mkdir(path, 0777) == 0);
+	make_path(path, "dir1/my_subdir");
+	PG_REQUIRE_SYS(mkdir(path, 0777) == 0);
+	make_path(path, "dir1/test.txt");
+	fd = open(path, O_CREAT | O_RDWR | PG_BINARY, 0777);
+	PG_REQUIRE_SYS(fd >= 0);
+	PG_REQUIRE_SYS(write(fd, "hello world\n", 12) == 12);
+	PG_REQUIRE_SYS(close(fd) == 0);
+
+	/* Tests for symlink()/readlink(). */
+
+	make_path(path, "dir999/my_symlink");	/* name of symlink to create */
+	make_path(path2, "dir1/my_subdir"); /* name of directory it will point to */
+	PG_EXPECT(symlink(path2, path) == -1, "symlink fails on missing parent");
+	PG_EXPECT_EQ(errno, ENOENT);
+
+	make_path(path, "dir1/my_symlink"); /* name of symlink to create */
+	make_path(path2, "dir1/my_subdir"); /* name of directory it will point to */
+	PG_EXPECT_SYS(symlink(path2, path) == 0, "create symlink");
+
+	size = readlink(path, path3, sizeof(path3));
+	PG_EXPECT_SYS(size != -1, "readlink succeeds");
+	PG_EXPECT_EQ(size, strlen(path2), "readlink reports expected size");
+	path3[size] = '\0';
+#ifdef WIN32
+	normalize_slashes(path2);
+	normalize_slashes(path3);
+#endif
+	PG_EXPECT_EQ_STR(path2, path3, "readlink reports expected target");
+
+	PG_EXPECT(readlink("does-not-exist", path3, sizeof(path3)) == -1, "readlink fails on missing path");
+	PG_EXPECT_EQ(errno, ENOENT);
+
+	/* Tests for fstat(). */
+
+	make_path(path, "dir1/test.txt");
+	fd = open(path, O_RDWR, 0777);
+	PG_REQUIRE_SYS(fd >= 0);
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(fstat(fd, &statbuf) == 0, "fstat regular file");
+	PG_EXPECT(S_ISREG(statbuf.st_mode));
+	PG_EXPECT_EQ(statbuf.st_size, 12);
+	PG_EXPECT_EQ(statbuf.st_nlink, 1);
+	PG_REQUIRE_SYS(close(fd) == 0);
+
+	/* Tests for stat(). */
+
+	PG_EXPECT(stat("does-not-exist.txt", &statbuf) == -1, "stat missing file fails");
+	PG_EXPECT_EQ(errno, ENOENT);
+
+	make_path(path, "dir1/test.txt");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(stat(path, &statbuf) == 0, "stat regular file");
+	PG_EXPECT(S_ISREG(statbuf.st_mode));
+	PG_EXPECT_EQ(statbuf.st_size, 12);
+	PG_EXPECT_EQ(statbuf.st_nlink, 1);
+
+	make_path(path, "dir1/my_subdir");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(stat(path, &statbuf) == 0, "stat directory");
+	PG_EXPECT(S_ISDIR(statbuf.st_mode));
+
+	make_path(path, "dir1/my_symlink");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(stat(path, &statbuf) == 0, "stat symlink");
+	PG_EXPECT(S_ISDIR(statbuf.st_mode));
+
+	/* Tests for lstat(). */
+
+	PG_EXPECT(stat("does-not-exist.txt", &statbuf) == -1, "lstat missing file fails");
+	PG_EXPECT_EQ(errno, ENOENT);
+
+	make_path(path, "dir1/test.txt");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(lstat(path, &statbuf) == 0, "lstat regular file");
+	PG_EXPECT(S_ISREG(statbuf.st_mode));
+	PG_EXPECT_EQ(statbuf.st_size, 12);
+	PG_EXPECT_EQ(statbuf.st_nlink, 1);
+
+	make_path(path2, "dir1/my_subdir");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(lstat(path2, &statbuf) == 0, "lstat directory");
+	PG_EXPECT(S_ISDIR(statbuf.st_mode));
+
+	make_path(path, "dir1/my_symlink");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(lstat(path, &statbuf) == 0, "lstat symlink");
+	PG_EXPECT(S_ISLNK(statbuf.st_mode));
+	PG_EXPECT_EQ(statbuf.st_size, strlen(path2), "got expected symlink size");
+
+	/* Tests for link() and unlink(). */
+
+	make_path(path, "does-not-exist-1");
+	make_path(path2, "does-not-exist-2");
+	PG_EXPECT(link(path, path2) == -1, "link missing file fails");
+	PG_EXPECT_EQ(errno, ENOENT);
+
+	make_path(path, "dir1/test.txt");
+	make_path(path2, "dir1/test2.txt");
+	PG_EXPECT_SYS(link(path, path2) == 0, "link succeeds");
+
+	PG_EXPECT(lstat(path, &statbuf) == 0, "lstat link 1 succeeds");
+	PG_EXPECT(S_ISREG(statbuf.st_mode));
+	PG_EXPECT_EQ(statbuf.st_size, 12);
+	PG_EXPECT_EQ(statbuf.st_nlink, 2);
+
+	PG_EXPECT(lstat(path, &statbuf) == 0, "lstat link 2 succeeds");
+	PG_EXPECT(S_ISREG(statbuf.st_mode));
+	PG_EXPECT_EQ(statbuf.st_size, 12);
+	PG_EXPECT_EQ(statbuf.st_nlink, 2);
+
+	PG_EXPECT_SYS(unlink(path2) == 0, "unlink succeeds");
+
+	PG_EXPECT(lstat(path, &statbuf) == 0, "lstat link 1 succeeds");
+	PG_EXPECT(S_ISREG(statbuf.st_mode));
+	PG_EXPECT_EQ(statbuf.st_size, 12);
+	PG_EXPECT_EQ(statbuf.st_nlink, 1);
+
+	PG_EXPECT_SYS(lstat(path2, &statbuf) == -1, "lstat link 2 fails");
+	PG_EXPECT_EQ(errno, ENOENT);
+
+	/*
+	 * On Windows we have a special code-path to make unlink() work on
+	 * junction points created by our symlink() wrapper, to match POSIX
+	 * behavior.
+	 */
+	make_path(path, "unlink_me_symlink");
+	make_path(path2, "dir1");
+	PG_EXPECT_SYS(symlink(path2, path) == 0, "create symlink");
+	PG_EXPECT_SYS(unlink(path) == 0, "unlink symlink");
+
+#ifdef WIN32
+
+	/*
+	 * But make sure that it doesn't work on plain old directories.
+	 *
+	 * POSIX doesn't specify whether unlink() works on a directory, so we
+	 * don't check that on non-Windows.  In practice almost all known systems
+	 * fail with EPERM (POSIX systems) or EISDIR (non-standard error on
+	 * Linux), though AIX/JFS1 is rumored to succeed.  However, our Windows
+	 * emulation doesn't allow it, because we want to avoid surprises by
+	 * behaving like nearly all Unix systems.  So we check this on Windows
+	 * only, where it fails with non-standard EACCES.
+	 */
+	PG_EXPECT_SYS(unlink(path2) == -1, "Windows: can't unlink() a directory");
+	PG_EXPECT_EQ(errno, EACCES);
+#endif
+
+#ifdef WIN32
+
+	/*
+	 * Test that we automatically retry for a while if we get a sharing
+	 * violation because external software does not use FILE_SHARE_* when
+	 * opening our files.  See similar open() test for explanation.
+	 */
+
+	make_path(path, "name1.txt");
+
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	handle = CreateFile(path, GENERIC_WRITE | GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+	PG_REQUIRE(handle);
+
+	pgwin32_dirmod_loops = 2;	/* minimize looping to fail fast in testing */
+	PG_EXPECT(unlink(path) == -1, "Windows: can't unlink file while non-shared handle exists");
+	PG_EXPECT_EQ(errno, EACCES);
+
+	pgwin32_dirmod_loops = 1800;	/* loop for up to 180s to make sure our
+									 * 100ms callback is run */
+	run_async_procedure_after_delay(close_handle, handle, 100); /* close handle after
+																 * 100ms */
+	PG_EXPECT_SYS(unlink(path) == 0, "Windows: can rename file after non-shared handle asynchronously closed");
+#endif
+
+	/* Tests for rename(). */
+
+	make_path(path, "name1.txt");
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	make_path(path2, "name2.txt");
+	PG_EXPECT_SYS(rename(path, path2) == 0, "rename name1.txt -> name2.txt");
+
+	PG_EXPECT_EQ(open(path, O_RDWR | PG_BINARY, 0777), -1, "can't open name1.txt anymore");
+	PG_EXPECT_EQ(errno, ENOENT);
+
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	make_path(path2, "name2.txt");
+	PG_EXPECT_SYS(rename(path, path2) == 0, "rename name1.txt -> name2.txt, replacing it");
+
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	fd = open(path2, O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "open name2.txt");
+	make_path(path2, "name2.txt");
+#ifdef WIN32
+
+	/*
+	 * Windows can't rename over an open non-unlinked file, even with
+	 * have_posix_unlink_semantics.
+	 */
+	pgwin32_dirmod_loops = 2;	/* minimize looping to fail fast in testing */
+	PG_EXPECT_SYS(rename(path, path2) == -1,
+				  "Windows: can't rename name1.txt -> name2.txt while name2.txt is open");
+	PG_EXPECT_EQ(errno, EACCES);
+	PG_EXPECT_SYS(unlink(path) == 0, "unlink name1.txt");
+#else
+	PG_EXPECT_SYS(rename(path, path2) == 0,
+				  "POSIX: can rename name1.txt -> name2.txt while name2.txt is open");
+#endif
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	make_path(path, "name1.txt");
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+	/* leave open */
+	make_path(path2, "name2.txt");
+	PG_EXPECT_SYS(rename(path, path2) == 0,
+				  "can rename name1.txt -> name2.txt while name1.txt is open");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	make_path(path, "name1.txt");
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	fd = open(path2, O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "open name2.txt");
+	PG_EXPECT_SYS(unlink(path2) == 0, "unlink name2.txt while it is still open");
+
+	if (have_posix_unlink_semantics)
+	{
+		PG_EXPECT_SYS(rename(path, path2) == 0,
+					  "POSIX: rename name1.txt -> name2.txt while unlinked file is still open");
+	}
+	else
+	{
+		PG_EXPECT_SYS(rename(path, path2) == -1,
+					  "Windows non-POSIX: cannot rename name1.txt -> name2.txt while unlinked file is still open");
+		PG_EXPECT_EQ(errno, EACCES);
+		PG_REQUIRE_SYS(unlink(path) == 0);
+	}
+	PG_REQUIRE_SYS(close(fd) == 0);
+
+#ifdef WIN32
+
+	/*
+	 * Test that we automatically retry for a while if we get a sharing
+	 * violation because external software does not use FILE_SHARE_* when
+	 * opening our files.  See similar open() test for explanation.
+	 */
+
+	make_path(path, "name1.txt");
+	make_path(path2, "name2.txt");
+
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	handle = CreateFile(path, GENERIC_WRITE | GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+	PG_REQUIRE(handle);
+
+	pgwin32_dirmod_loops = 2;	/* minimize looping to fail fast in testing */
+	PG_EXPECT(rename(path, path2) == -1, "Windows: can't rename file while non-shared handle exists for name1");
+	PG_EXPECT_EQ(errno, EACCES);
+
+	pgwin32_dirmod_loops = 1800;	/* loop for up to 180s to make sure our
+									 * 100ms callback is run */
+	run_async_procedure_after_delay(close_handle, handle, 100); /* close handle after
+																 * 100ms */
+	PG_EXPECT_SYS(rename(path, path2) == 0, "Windows: can rename file after non-shared handle asynchronously closed");
+
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	handle = CreateFile(path2, GENERIC_WRITE | GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+	PG_REQUIRE(handle);
+
+	pgwin32_dirmod_loops = 2;	/* minimize looping to fail fast in testing */
+	PG_EXPECT(rename(path, path2) == -1, "Windows: can't rename file while non-shared handle exists for name2");
+	PG_EXPECT_EQ(errno, EACCES);
+
+	pgwin32_dirmod_loops = 1800;	/* loop for up to 180s to make sure our
+									 * 100ms callback is run */
+	run_async_procedure_after_delay(close_handle, handle, 100); /* close handle after
+																 * 100ms */
+	PG_EXPECT_SYS(rename(path, path2) == 0, "Windows: can rename file after non-shared handle asynchronously closed");
+
+	PG_REQUIRE_SYS(unlink(path2) == 0);
+#endif
+
+	/* Tests for open(). */
+
+	make_path(path, "test.txt");
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "open(O_CREAT | O_EXCL) succeeds");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_EQ(fd, -1, "open(O_CREAT | O_EXCL) again fails");
+	PG_EXPECT_EQ(errno, EEXIST);
+
+	fd = open(path, O_CREAT | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "open(O_CREAT) succeeds");
+
+	/*
+	 * We can unlink a file that we still have an open handle for on Windows,
+	 * because our open() wrapper used FILE_SHARE_DELETE.
+	 */
+	PG_EXPECT_SYS(unlink(path) == 0);
+
+	/*
+	 * On older Windows, our open wrapper on Windows will see
+	 * STATUS_DELETE_PENDING and pretend it can't see the file.  On newer
+	 * Windows, it won't see the file.  Either way matches expected POSIX
+	 * behavior.
+	 */
+	PG_EXPECT(open(path, O_RDWR | PG_BINARY, 0777) == -1);
+	PG_EXPECT(errno == ENOENT);
+
+	/*
+	 * Things are trickier if you asked to create a file.  ENOENT doesn't make
+	 * sense as a error number to a program expecting POSIX behavior then, so
+	 * our wrpaper uses EEXIST for lack of anything more appropriate.
+	 */
+	fd2 = open(path, O_RDWR | PG_BINARY | O_CREAT, 0777);
+	if (have_posix_unlink_semantics)
+	{
+		PG_EXPECT_SYS(fd2 >= 0, "POSIX: create file with same name, while unlinked file is still open");
+		PG_EXPECT_SYS(close(fd2) == 0);
+	}
+	else
+	{
+		PG_EXPECT_SYS(fd2 == -1,
+					  "Windows non-POSIX: cannot create file with same name, while unlinked file is still open");
+		PG_EXPECT_EQ(errno, EEXIST);
+	}
+
+	/* After closing the handle, Windows non-POSIX can now create a new file. */
+	PG_EXPECT_SYS(close(fd) == 0);
+	fd = open(path, O_RDWR | PG_BINARY | O_CREAT, 0777);
+	PG_EXPECT_SYS(fd >= 0,
+				  "can create a file with recently unlinked name after closing");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+#ifdef WIN32
+
+	/*
+	 * Even on systems new enough to have POSIX unlink behavior, the problem
+	 * of sharing violations still applies.
+	 *
+	 * Our own open() wrapper is careful to use the FILE_SHARE_* flags to
+	 * avoid the dreaded ERROR_SHARING_VIOLATION, but it's possible that
+	 * another program might open one of our files without them.  In that
+	 * case, our open() wrapper will sleep and retry for a long time, in the
+	 * hope that other program goes away.  So let's open a handle with
+	 * antisocial flags. We'll adjust the loop count via a side-hatch so we
+	 * don't waste time in this test, though.
+	 */
+	handle = CreateFile(path, GENERIC_WRITE | GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
+	PG_REQUIRE(handle);
+
+	pgwin32_open_handle_loops = 2;	/* minimize retries so test is fast */
+	PG_EXPECT(open(path, O_RDWR | PG_BINARY, 0777) == -1, "Windows: can't open file while non-shared handle exists");
+	PG_EXPECT_EQ(errno, EACCES);
+
+	pgwin32_open_handle_loops = 18000;	/* 180s must be long enough for our
+										 * async callback to run */
+	run_async_procedure_after_delay(close_handle, handle, 100); /* close handle after
+																 * 100ms */
+	fd = open(path, O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "Windows: can open file after non-shared handle asynchronously closed");
+	PG_REQUIRE_SYS(close(fd) == 0);
+#endif
+
+	PG_EXPECT_SYS(unlink(path) == 0);
+
+	/* Tests for opendir(), readdir(), closedir(). */
+	{
+		DIR		   *dir;
+		struct dirent *de;
+		int			dot = -1;
+		int			dotdot = -1;
+		int			my_subdir = -1;
+		int			my_symlink = -1;
+		int			test_txt = -1;
+
+		make_path(path, "does-not-exist");
+		PG_EXPECT(opendir(path) == NULL, "open missing directory fails");
+		PG_EXPECT_EQ(errno, ENOENT);
+
+		make_path(path, "dir1");
+		PG_EXPECT_SYS((dir = opendir(path)), "open directory");
+
+#ifdef DT_REG
+/*
+ * Linux, *BSD, macOS and our Windows wrappers have BSD d_type.  On a few rare
+ * file systems, it may be reported as DT_UNKNOWN so we have to tolerate that too.
+ */
+#define LOAD(name, variable) if (strcmp(de->d_name, name) == 0) variable = de->d_type
+#define CHECK(name, variable, type) \
+	PG_EXPECT(variable != -1, name " was found"); \
+	PG_EXPECT(variable == DT_UNKNOWN || variable == type, name " has type DT_UNKNOWN or " #type)
+#else
+/*
+ * Solaris, AIX and Cygwin do not have it (it's not in POSIX).  Just check that
+ * we saw the file and ignore the type.
+ */
+#define LOAD(name, variable) if (strcmp(de->d_name, name) == 0) variable = 0
+#define CHECK(name, variable, type) \
+	PG_EXPECT(variable != -1, name " was found")
+#endif
+
+		/* Load and check in two phases because the order is unknown. */
+		errno = 0;
+		while ((de = readdir(dir)))
+		{
+			LOAD(".", dot);
+			LOAD("..", dotdot);
+			LOAD("my_subdir", my_subdir);
+			LOAD("my_symlink", my_symlink);
+			LOAD("test.txt", test_txt);
+		}
+		PG_EXPECT_SYS(errno == 0, "ran out of dirents without error");
+
+		CHECK(".", dot, DT_DIR);
+		CHECK("..", dotdot, DT_DIR);
+		CHECK("my_subdir", my_subdir, DT_DIR);
+		CHECK("my_symlink", my_symlink, DT_LNK);
+		CHECK("test.txt", test_txt, DT_REG);
+
+#undef LOAD
+#undef CHECK
+	}
+
+	/*
+	 * On older Windows, readdir() sees entries that are in
+	 * STATUS_DELETE_PENDING state, and they prevent the directory itself from
+	 * being unlinked.  Demonstrate that difference here.
+	 */
+	make_path(path, "dir2");
+	PG_REQUIRE_SYS(mkdir(path, 0777) == 0);
+	make_path(path, "dir2/test.txt");
+	fd = open(path, O_CREAT | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "open file");
+	PG_EXPECT_SYS(unlink(path) == 0, "unlink file while still open");
+	if (have_posix_unlink_semantics)
+		PG_EXPECT(!directory_contains("dir2", "test.txt"), "POSIX: readdir doesn't see it before closing");
+	else
+		PG_EXPECT(directory_contains("dir2", "test.txt"), "Windows non-POSIX: readdir still sees it before closing");
+	PG_EXPECT_SYS(close(fd) == 0);
+	PG_EXPECT(!directory_contains("dir2", "test.txt"), "readdir doesn't see it after closing");
+}
+
+/*
+ * Tests for pread, pwrite, and vector variants.
+ */
+static void
+filesystem_io_tests(void)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+	char		buffer[80];
+	char		buffer2[80];
+	struct iovec iov[8];
+
+	/*
+	 * These are our own wrappers on Windows.  On Unix, they're just macros
+	 * for system APIs, except on Solaris which shares the pg_{read,write}v
+	 * replacements used on Windows.
+	 */
+
+	make_path(path, "io_test_file.txt");
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "create a new file");
+
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 0, "initial file position is zero");
+
+	PG_EXPECT_EQ(pg_pwrite(fd, "llo", 3, 2), 3);
+	PG_EXPECT_EQ(pg_pwrite(fd, "he", 2, 0), 2);
+
+	/*
+	 * On Windows, the current position moves, which is why we have the pg_
+	 * prefixes as a warning to users.  Demonstrate that difference here.
+	 */
+#ifdef WIN32
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 2, "Windows: file position moved");
+#else
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 0, "POSIX: file position didn't move");
+#endif
+
+	memset(buffer, 0, sizeof(buffer));
+	PG_EXPECT_EQ(pg_pread(fd, buffer, 2, 3), 2);
+	PG_EXPECT(memcmp(buffer, "lo", 2) == 0);
+	PG_EXPECT_EQ(pg_pread(fd, buffer, 3, 0), 3);
+	PG_EXPECT(memcmp(buffer, "hel", 3) == 0);
+
+#ifdef WIN32
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 3, "Windows: file position moved");
+#else
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 0, "POSIX: file position didn't move");
+#endif
+
+	PG_EXPECT_EQ(pg_pread(fd, buffer, 80, 0), 5, "pg_pread short read");
+	PG_EXPECT_EQ(pg_pread(fd, buffer, 80, 5), 0, "pg_pread EOF");
+
+	iov[0].iov_base = "wo";
+	iov[0].iov_len = 2;
+	iov[1].iov_base = "rld";
+	iov[1].iov_len = 3;
+	PG_EXPECT_EQ(pg_pwritev(fd, iov, 2, 5), 5);
+
+#ifdef WIN32
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 10, "Windows: file position moved");
+#else
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 0, "POSIX: file position didn't move");
+#endif
+
+	memset(buffer, 0, sizeof(buffer));
+	memset(buffer2, 0, sizeof(buffer2));
+	iov[0].iov_base = buffer;
+	iov[0].iov_len = 4;
+	iov[1].iov_base = buffer2;
+	iov[1].iov_len = 4;
+	PG_EXPECT_EQ(pg_preadv(fd, iov, 2, 1), 8);
+
+#ifdef WIN32
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 9, "Windows: file position moved");
+#else
+	PG_EXPECT_EQ(lseek(fd, 0, SEEK_CUR), 0, "POSIX: file position didn't move");
+#endif
+
+	PG_EXPECT(memcmp(buffer, "ello", 4) == 0);
+	PG_EXPECT(memcmp(buffer2, "worl", 4) == 0);
+
+	memset(buffer, 0, sizeof(buffer));
+	memset(buffer2, 0, sizeof(buffer2));
+	iov[0].iov_base = buffer;
+	iov[0].iov_len = 1;
+	iov[1].iov_base = buffer2;
+	iov[1].iov_len = 80;
+	PG_EXPECT_EQ(pg_preadv(fd, iov, 2, 8), 2);
+	PG_EXPECT_EQ(pg_preadv(fd, iov, 2, 9), 1);
+	PG_EXPECT_EQ(pg_preadv(fd, iov, 2, 10), 0);
+	PG_EXPECT_EQ(pg_preadv(fd, iov, 2, 11), 0);
+
+	memset(buffer, 0, sizeof(buffer));
+	PG_EXPECT_EQ(pg_pread(fd, buffer, 10, 0), 10);
+	PG_EXPECT_EQ_STR(buffer, "helloworld");
+
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	/* Demonstrate the effects of "text" mode (no PG_BINARY flag). */
+
+	/* Write out a message in text mode. */
+	fd = open(path, O_RDWR, 0777);
+	PG_EXPECT_SYS(fd >= 0, "open file in text mode");
+	PG_EXPECT_EQ(write(fd, "hello world\n", 12), 12);
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	/* Read it back in binary mode, to reveal the translation. */
+	fd = open(path, O_RDWR | PG_BINARY, 0777);
+#ifdef WIN32
+	PG_EXPECT_EQ(read(fd, buffer, sizeof(buffer)), 13,
+				 "Windows: \\n was translated");
+	PG_EXPECT(memcmp(buffer, "hello world\r\n", 13) == 0,
+			  "Windows: \\n was translated to \\r\\n");
+#else
+	PG_EXPECT_EQ(read(fd, buffer, sizeof(buffer)), 12,
+				 "POSIX: \\n was not translated");
+	PG_EXPECT(memcmp(buffer, "hello world\n", 12) == 0,
+			  "POSIX: \\n is \\n");
+#endif
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	/* The opposite translation happens in text mode, hiding it. */
+	fd = open(path, O_RDWR, 0777);
+	PG_EXPECT_EQ(read(fd, buffer, sizeof(buffer)), 12);
+	PG_EXPECT(memcmp(buffer, "hello world\n", 12) == 0);
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	PG_REQUIRE_SYS(unlink(path) == 0);
+
+	/*
+	 * Write out a message in text mode, this time with pg_pwrite(), which
+	 * does not suffer from newline translation.
+	 */
+	fd = open(path, O_RDWR | O_CREAT, 0777);
+	PG_EXPECT_SYS(fd >= 0, "open file in text mode");
+	PG_EXPECT_EQ(pg_pwrite(fd, "hello world\n", 12, 0), 12);
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	/* Read it back in binary mode to verify that. */
+	fd = open(path, O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_EQ(pg_pread(fd, buffer, sizeof(buffer), 0), 12,
+				 "\\n was not translated by pg_pread()");
+	PG_EXPECT_SYS(close(fd) == 0);
+
+	PG_REQUIRE_SYS(unlink(path) == 0);
+}
+
+/*
+ * Exercise fsync and fdatasync.
+ */
+static void
+filesystem_sync_tests(void)
+{
+	int			fd;
+	char		path[MAXPGPATH];
+
+	make_path(path, "sync_me.txt");
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_REQUIRE_SYS(fd >= 0);
+
+	PG_REQUIRE_SYS(write(fd, "x", 1) == 1);
+	PG_EXPECT_SYS(fsync(fd) == 0);
+
+	PG_REQUIRE_SYS(write(fd, "x", 1) == 1);
+	PG_EXPECT_SYS(fdatasync(fd) == 0);
+
+	PG_REQUIRE_SYS(unlink(path) == 0);
+}
+
+int
+main()
+{
+	PG_BEGIN_TESTS();
+
+	filesystem_metadata_tests();
+	filesystem_io_tests();
+	filesystem_sync_tests();
+
+	PG_END_TESTS();
+}
diff --git a/src/port/t/meson.build b/src/port/t/meson.build
new file mode 100644
index 0000000000..bc5c7fc0b8
--- /dev/null
+++ b/src/port/t/meson.build
@@ -0,0 +1,15 @@
+port_tests = []
+port_tests += executable('001_filesystem',
+  ['001_filesystem.c'],
+  dependencies: [frontend_code],
+  kwargs: test_bin_args,
+)
+
+tests += {
+  'name': 'port',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'exec_tap': {
+    'tests': port_tests,
+  },
+}
-- 
2.37.3

From b61a66f874d3b8151158e7956a3272e9259a8941 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 16 Oct 2022 12:07:33 +1300
Subject: [PATCH 06/10] Fix lstat() on broken junction points.

When using junction points to emulate symlinks on Windows, one edge case
was not handled correctly by commit c5cb8f3b: if a junction point is
broken (pointing to a non-existent path), we'd report ENOENT.  This
doesn't break any known use case, but was noticed while testing and is
fixed here for completeness.

Also add translation ERROR_CANT_RESOLVE_FILENAME -> ENOENT, as that is
one of the errors Windows can report depending on format of the broken
path.
---
 src/port/t/001_filesystem.c |  7 +++++++
 src/port/win32error.c       |  3 +++
 src/port/win32stat.c        | 27 ++++++++++++++++++++++-----
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/src/port/t/001_filesystem.c b/src/port/t/001_filesystem.c
index c2832b0a20..28ef069ee0 100644
--- a/src/port/t/001_filesystem.c
+++ b/src/port/t/001_filesystem.c
@@ -273,6 +273,13 @@ filesystem_metadata_tests(void)
 	PG_EXPECT(S_ISLNK(statbuf.st_mode));
 	PG_EXPECT_EQ(statbuf.st_size, strlen(path2), "got expected symlink size");
 
+	make_path(path, "broken-symlink");
+	make_path(path2, "does-not-exist");
+	PG_EXPECT_SYS(symlink(path2, path) == 0, "make a broken symlink");
+	PG_EXPECT_SYS(lstat(path, &statbuf) == 0, "lstat broken symlink");
+	PG_EXPECT(S_ISLNK(statbuf.st_mode));
+	PG_EXPECT_SYS(unlink(path) == 0);
+
 	/* Tests for link() and unlink(). */
 
 	make_path(path, "does-not-exist-1");
diff --git a/src/port/win32error.c b/src/port/win32error.c
index a78d323827..67ce805d77 100644
--- a/src/port/win32error.c
+++ b/src/port/win32error.c
@@ -167,6 +167,9 @@ static const struct
 	},
 	{
 		ERROR_INVALID_NAME, ENOENT
+	},
+	{
+		ERROR_CANT_RESOLVE_FILENAME, ENOENT
 	}
 };
 
diff --git a/src/port/win32stat.c b/src/port/win32stat.c
index 5f3d0d22ff..ce8d87093d 100644
--- a/src/port/win32stat.c
+++ b/src/port/win32stat.c
@@ -125,15 +125,30 @@ _pglstat64(const char *name, struct stat *buf)
 
 	hFile = pgwin32_open_handle(name, O_RDONLY, true);
 	if (hFile == INVALID_HANDLE_VALUE)
-		return -1;
-
-	ret = fileinfo_to_stat(hFile, buf);
+	{
+		if (errno == ENOENT)
+		{
+			/*
+			 * If it's a junction point pointing to a non-existent path, we'll
+			 * have ENOENT here (because pgwin32_open_handle does not use
+			 * FILE_FLAG_OPEN_REPARSE_POINT).  In that case, we'll try again
+			 * with readlink() below, which will distinguish true ENOENT from
+			 * pseudo-symlink.
+			 */
+			memset(buf, 0, sizeof(*buf));
+			ret = 0;
+		}
+		else
+			return -1;
+	}
+	else
+		ret = fileinfo_to_stat(hFile, buf);
 
 	/*
 	 * Junction points appear as directories to fileinfo_to_stat(), so we'll
 	 * need to do a bit more work to distinguish them.
 	 */
-	if (ret == 0 && S_ISDIR(buf->st_mode))
+	if ((ret == 0 && S_ISDIR(buf->st_mode)) || hFile == INVALID_HANDLE_VALUE)
 	{
 		char		next[MAXPGPATH];
 		ssize_t		size;
@@ -169,10 +184,12 @@ _pglstat64(const char *name, struct stat *buf)
 			buf->st_mode &= ~S_IFDIR;
 			buf->st_mode |= S_IFLNK;
 			buf->st_size = size;
+			ret = 0;
 		}
 	}
 
-	CloseHandle(hFile);
+	if (hFile != INVALID_HANDLE_VALUE)
+		CloseHandle(hFile);
 	return ret;
 }
 
-- 
2.37.3

From 664fa8d965151b4cbf1e0408b2944cb8b7d5bc22 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmu...@postgresql.org>
Date: Thu, 11 Aug 2022 10:42:13 +1200
Subject: [PATCH 07/10] Fix readlink() for non-PostgreSQL-created junction
 points.

Our symlink() and readlink() replacements perform a naive transformation
from DOS paths to NT paths and back, as required by the junction point
APIs.  This was enough for the "drive absolute" paths we expect users to
provide for tablespaces, for example "d:\my\fast\storage".

Since commit c5cb8f3b taught stat() to follow symlinks, and since initdb
uses pg_mkdir_p(), and that examines parent directories, our humble
readlink() implementation can now be exposed to junction points not of
PostgreSQL origin.  Those might be corrupted by our naive path mangling,
which doesn't really understand NT paths in general.

Simply decline to transform paths that don't look like a drive absolute
path.  That means that readlink() returns the NT path directly when
checking a parent directory of PGDATA that happen to point to a drive
using "rooted" format.  That  works for the purposes of our stat()
emulation.

Reported-by: Roman Zharkov <r.zhar...@postgrespro.ru>
Reviewed-by: Roman Zharkov <r.zhar...@postgrespro.ru>
Discussion: https://postgr.es/m/4590c37927d7b8ee84f9855d83229018%40postgrespro.ru
---
 src/port/dirmod.c           | 17 ++++++++++---
 src/port/t/001_filesystem.c | 51 +++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index 3e1c78fae6..f7fd4c15ad 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -366,10 +366,21 @@ pgreadlink(const char *path, char *buf, size_t size)
 	r -= 1;
 
 	/*
-	 * If the path starts with "\??\", which it will do in most (all?) cases,
-	 * strip those out.
+	 * If the path starts with "\??\" followed by a "drive absolute" path
+	 * (known to Windows APIs as RtlPathTypeDriveAbsolute), then strip that
+	 * prefix.  This undoes some of the transformation performed by
+	 * pqsymlink(), to get back to a format that users are used to seeing.  We
+	 * don't know how to transform other path types that might be encountered
+	 * outside PGDATA, so we just return them directly.
 	 */
-	if (r > 4 && strncmp(buf, "\\??\\", 4) == 0)
+	if (r >= 7 &&
+		buf[0] == '\\' &&
+		buf[1] == '?' &&
+		buf[2] == '?' &&
+		buf[3] == '\\' &&
+		isalpha(buf[4]) &&
+		buf[5] == ':' &&
+		buf[6] == '\\')
 	{
 		memmove(buf, buf + 4, strlen(buf + 4) + 1);
 		r -= 4;
diff --git a/src/port/t/001_filesystem.c b/src/port/t/001_filesystem.c
index 28ef069ee0..4178b8e1f2 100644
--- a/src/port/t/001_filesystem.c
+++ b/src/port/t/001_filesystem.c
@@ -216,6 +216,57 @@ filesystem_metadata_tests(void)
 	PG_EXPECT(readlink("does-not-exist", path3, sizeof(path3)) == -1, "readlink fails on missing path");
 	PG_EXPECT_EQ(errno, ENOENT);
 
+	/*
+	 * Checks that we don't corrupt non-drive-absolute paths when peforming
+	 * internal conversions.
+	 */
+
+	/*
+	 * Typical case: Windows drive absolute.  This should also be accepted on
+	 * POSIX systems, because they are required not to validate the target
+	 * string as a path.
+	 */
+	make_path(path2, "my_symlink");
+	PG_EXPECT_SYS(symlink("c:\\foo", path2) == 0);
+	size = readlink(path2, path3, sizeof(path3));
+	PG_EXPECT_SYS(size != -1, "readlink succeeds");
+	PG_EXPECT_EQ(size, 6);
+	path3[size] = '\0';
+	PG_EXPECT_EQ_STR(path3, "c:\\foo");
+	PG_EXPECT_SYS(unlink(path2) == 0);
+
+	/*
+	 * Drive absolute given in full NT format will be stripped on round-trip
+	 * through our Windows emulations.
+	 */
+	make_path(path2, "my_symlink");
+	PG_EXPECT_SYS(symlink("\\??\\c:\\foo", path2) == 0);
+	size = readlink(path2, path3, sizeof(path3));
+	PG_EXPECT_SYS(size != -1, "readlink succeeds");
+	path3[size] = '\0';
+#ifdef WIN32
+	PG_EXPECT_EQ(size, 6);
+	PG_EXPECT_EQ_STR(path3, "c:\\foo");
+#else
+	PG_EXPECT_EQ(size, 10);
+	PG_EXPECT_EQ_STR(path3, "\\??\\c:\\foo");
+#endif
+	PG_EXPECT_SYS(unlink(path2) == 0);
+
+	/*
+	 * Anything that doesn't look like the NT pattern that symlink() creates
+	 * will be returned verbatim.  This will allow our stat() to handle paths
+	 * that were not created by symlink().
+	 */
+	make_path(path2, "my_symlink");
+	PG_EXPECT_SYS(symlink("\\??\\Volume1234", path2) == 0);
+	size = readlink(path2, path3, sizeof(path3));
+	PG_EXPECT_SYS(size != -1, "readlink succeeds");
+	PG_EXPECT_EQ(size, 14);
+	path3[size] = '\0';
+	PG_EXPECT_EQ_STR(path3, "\\??\\Volume1234");
+	PG_EXPECT_SYS(unlink(path2) == 0);
+
 	/* Tests for fstat(). */
 
 	make_path(path, "dir1/test.txt");
-- 
2.37.3

From d76d6152eaf323005829e1856915b1969c3a93b2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 11 Aug 2022 12:30:48 +1200
Subject: [PATCH 08/10] Fix stat() for recursive junction points on Windows.

Commit c5cb8f3b supposed that we'd only ever have to follow one junction
point in stat(), because we don't construct longer chains of them ourselves.
When examining a parent directory supplied by the user, we should really be
able to cope with longer chains, just in case someone has their system
set up that way.  Choose an arbitrary cap of 8, to match the minimum
acceptable value of SYMLOOP_MAX in POSIX.

Previously I'd avoided reporting ELOOP thinking Windows didn't have it,
but it turns out that it does.

Reviewed-by: Roman Zharkov <r.zhar...@postgrespro.ru>
Discussion: https://postgr.es/m/CA%2BhUKGJ7JDGWYFt9%3D-TyJiRRy5q9TtPfqeKkneWDr1XPU1%2Biqw%40mail.gmail.com
---
 src/port/t/001_filesystem.c | 55 +++++++++++++++++++++++++++++++++++++
 src/port/win32stat.c        | 26 +++++++++---------
 2 files changed, 68 insertions(+), 13 deletions(-)

diff --git a/src/port/t/001_filesystem.c b/src/port/t/001_filesystem.c
index 4178b8e1f2..501d7e1603 100644
--- a/src/port/t/001_filesystem.c
+++ b/src/port/t/001_filesystem.c
@@ -301,6 +301,61 @@ filesystem_metadata_tests(void)
 	PG_EXPECT(stat(path, &statbuf) == 0, "stat symlink");
 	PG_EXPECT(S_ISDIR(statbuf.st_mode));
 
+	/* Recursive symlinks. */
+	make_path(path, "dir1");
+	make_path(path2, "sym001");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym001 -> dir1");
+	make_path(path, "sym001");
+	make_path(path2, "sym002");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym002 -> sym001");
+	make_path(path, "sym002");
+	make_path(path2, "sym003");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym003 -> sym002");
+	make_path(path, "sym003");
+	make_path(path2, "sym004");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym004 -> sym003");
+	make_path(path, "sym004");
+	make_path(path2, "sym005");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym005 -> sym004");
+	make_path(path, "sym005");
+	make_path(path2, "sym006");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym006 -> sym005");
+	make_path(path, "sym006");
+	make_path(path2, "sym007");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym007 -> sym006");
+	make_path(path, "sym007");
+	make_path(path2, "sym008");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym008 -> sym007");
+	make_path(path, "sym008");
+	make_path(path2, "sym009");
+	PG_EXPECT_SYS(symlink(path, path2) == 0, "sym009 -> sym008");
+
+	/* POSIX says SYMLOOP_MAX should be at least 8. */
+	make_path(path, "sym008");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT_SYS(stat(path, &statbuf) == 0, "stat sym008");
+	PG_EXPECT(S_ISDIR(statbuf.st_mode));
+
+#ifdef WIN32
+
+	/*
+	 * Test ELOOP failure in our Windows implementation of stat(), because we
+	 * know it gives up after 8.
+	 */
+	make_path(path, "sym009");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(stat(path, &statbuf) == -1, "Windows: stat sym009 fails");
+	PG_EXPECT_EQ(errno, ELOOP);
+#endif
+
+	/* If we break the chain we get ENOENT. */
+	make_path(path, "sym003");
+	PG_EXPECT_SYS(unlink(path) == 0);
+	make_path(path, "sym008");
+	memset(&statbuf, 0, sizeof(statbuf));
+	PG_EXPECT(stat(path, &statbuf) == -1, "stat broken symlink chain fails");
+	PG_EXPECT_EQ(errno, ENOENT);
+
 	/* Tests for lstat(). */
 
 	PG_EXPECT(stat("does-not-exist.txt", &statbuf) == -1, "lstat missing file fails");
diff --git a/src/port/win32stat.c b/src/port/win32stat.c
index ce8d87093d..e6553e4030 100644
--- a/src/port/win32stat.c
+++ b/src/port/win32stat.c
@@ -199,23 +199,33 @@ _pglstat64(const char *name, struct stat *buf)
 int
 _pgstat64(const char *name, struct stat *buf)
 {
+	int			loops = 0;
 	int			ret;
+	char		curr[MAXPGPATH];
 
 	ret = _pglstat64(name, buf);
 
+	strlcpy(curr, name, MAXPGPATH);
+
 	/* Do we need to follow a symlink (junction point)? */
-	if (ret == 0 && S_ISLNK(buf->st_mode))
+	while (ret == 0 && S_ISLNK(buf->st_mode))
 	{
 		char		next[MAXPGPATH];
 		ssize_t		size;
 
+		if (++loops > 8)
+		{
+			errno = ELOOP;
+			return -1;
+		}
+
 		/*
 		 * _pglstat64() already called readlink() once to be able to fill in
 		 * st_size, and now we need to do it again to get the path to follow.
 		 * That could be optimized, but stat() on symlinks is probably rare
 		 * and this way is simple.
 		 */
-		size = readlink(name, next, sizeof(next));
+		size = readlink(curr, next, sizeof(next));
 		if (size < 0)
 		{
 			if (errno == EACCES &&
@@ -234,17 +244,7 @@ _pgstat64(const char *name, struct stat *buf)
 		next[size] = 0;
 
 		ret = _pglstat64(next, buf);
-		if (ret == 0 && S_ISLNK(buf->st_mode))
-		{
-			/*
-			 * We're only prepared to go one hop, because we only expect to
-			 * deal with the simple cases that we create.  The error for too
-			 * many symlinks is supposed to be ELOOP, but Windows hasn't got
-			 * it.
-			 */
-			errno = EIO;
-			return -1;
-		}
+		strcpy(curr, next);
 	}
 
 	return ret;
-- 
2.37.3

From 033e4898a076e65162be3dcd60badaa40ba8f736 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Mon, 3 Oct 2022 09:15:02 +1300
Subject: [PATCH 09/10] Fix unlink() for STATUS_DELETE_PENDING on Windows.

Commit c5cb8f3b didn't handle STATUS_DELETE_PENDING correctly, and would
report ENOENT immediately without waiting.  If we called unlink(name)
twice in a row and then rmdir(parent) while someone had an open handle,
previously the second call to unlink() would block until that handle
went away, or a 10 second timeout expired.  This change resulted in a
test occasionally failing on CI, which still has an OS without POSIX
semantics for unlink.  Restore.

Diagnosed-by: Justin Pryzby <pry...@telsasoft.com>
Reviewed-by: Michael Paquier <mich...@paquier.xyz>
Discussion: https://postgr.es/m/20220920013122.GA31833%40telsasoft.com
---
 src/port/dirmod.c           | 38 +++++++++++++++++++++++++++++++++++--
 src/port/t/001_filesystem.c | 36 +++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index f7fd4c15ad..f4abde97bf 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -39,6 +39,10 @@
 #endif
 #endif
 
+#if defined(WIN32) && !defined(__CYGWIN__)
+#include "port/win32ntdll.h"
+#endif
+
 #if defined(WIN32) || defined(__CYGWIN__)
 
 /* Externally visable only to allow testing. */
@@ -94,6 +98,22 @@ pgrename(const char *from, const char *to)
 	return 0;
 }
 
+/*
+ * Check if _pglstat64()'s reason for failure was STATUS_DELETE_PENDING.
+ * This doesn't apply to Cygwin, which has its own lstat() that would report
+ * the case as EACCES.
+*/
+static bool
+lstat_error_was_status_delete_pending(void)
+{
+	if (errno != ENOENT)
+		return false;
+#if defined(WIN32) && !defined(__CYGWIN__)
+	if (pg_RtlGetLastNtStatus() == STATUS_DELETE_PENDING)
+		return true;
+#endif
+	return false;
+}
 
 /*
  *	pgunlink
@@ -101,6 +121,7 @@ pgrename(const char *from, const char *to)
 int
 pgunlink(const char *path)
 {
+	bool		is_lnk;
 	int			loops = 0;
 	struct stat st;
 
@@ -125,9 +146,22 @@ pgunlink(const char *path)
 	 * due to sharing violations, but that seems unlikely.  We could perhaps
 	 * prevent that by holding a file handle ourselves across the lstat() and
 	 * the retry loop, but that seems like over-engineering for now.
+	 *
+	 * In the special case of a STATUS_DELETE_PENDING error (file already
+	 * unlinked, but someone still has it open), we don't want to report ENOENT
+	 * to the caller immediately, because rmdir(parent) would probably fail.
+	 * We want to wait until the file truly goes away so that simple recursive
+	 * directory unlink algorithms work.
 	 */
 	if (lstat(path, &st) < 0)
-		return -1;
+	{
+		if (lstat_error_was_status_delete_pending())
+			is_lnk = false;
+		else
+			return -1;
+	}
+	else
+		is_lnk = S_ISLNK(st.st_mode);
 
 	/*
 	 * We need to loop because even though PostgreSQL uses flags that allow
@@ -136,7 +170,7 @@ pgunlink(const char *path)
 	 * someone else to close the file, as the caller might be holding locks
 	 * and blocking other backends.
 	 */
-	while ((S_ISLNK(st.st_mode) ? rmdir(path) : unlink(path)) < 0)
+	while ((is_lnk ? rmdir(path) : unlink(path)) < 0)
 	{
 		if (errno != EACCES)
 			return -1;
diff --git a/src/port/t/001_filesystem.c b/src/port/t/001_filesystem.c
index 501d7e1603..346978783a 100644
--- a/src/port/t/001_filesystem.c
+++ b/src/port/t/001_filesystem.c
@@ -472,6 +472,42 @@ filesystem_metadata_tests(void)
 	PG_EXPECT_SYS(unlink(path) == 0, "Windows: can rename file after non-shared handle asynchronously closed");
 #endif
 
+	/*
+	 * Our Windows unlink() wrapper blocks in a retry loop if you try to
+	 * unlink a file in STATUS_DELETE_PENDING (ie that has already been
+	 * unlinked but is still open), until it times out with EACCES or reaches
+	 * ENOENT.  That may be useful for waiting for files to be asynchronously
+	 * unlinked while performing a recursive unlink on
+	 * !have_posix_unlink_semantics systems, so that rmdir(parent) works.
+	 */
+	make_path(path, "dir2");
+	PG_EXPECT_SYS(mkdir(path, 0777) == 0);
+	make_path(path, "dir2/test-file");
+	fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+	PG_EXPECT_SYS(fd >= 0, "open dir2/test-file");
+	PG_EXPECT_SYS(unlink(path) == 0, "unlink file while it's open, once");
+#ifdef WIN32
+	pgwin32_dirmod_loops = 2;	/* minimize looping to fail fast in testing */
+#endif
+	PG_EXPECT(unlink(path) == -1, "can't unlink again");
+	if (have_posix_unlink_semantics)
+	{
+		PG_EXPECT_EQ(errno, ENOENT, "POSIX: we expect ENOENT");
+	}
+	else
+	{
+		PG_EXPECT_EQ(errno, EACCES, "Windows non-POSIX: we expect EACCES (delete already pending)");
+#ifdef WIN32
+		pgwin32_dirmod_loops = 1800;	/* loop for up to 180s to make sure
+										 * our 100ms callback is run */
+		run_async_procedure_after_delay(close_fd, &fd, 100);	/* close fd after 100ms */
+#endif
+		PG_EXPECT(unlink(path) == -1, "Windows non-POSIX: trying again fails");
+		PG_EXPECT_EQ(errno, ENOENT, "Windows non-POSIX: ... but blocked until ENOENT was reached due to asynchronous close");
+	}
+	make_path(path, "dir2");
+	PG_EXPECT_SYS(rmdir(path) == 0, "now we can remove the directory");
+
 	/* Tests for rename(). */
 
 	make_path(path, "name1.txt");
-- 
2.37.3

From ac3518414215b08ed49be797bd6c57efbe36f07f Mon Sep 17 00:00:00 2001
From: Thomas <thomas.mu...@gmail.com>
Date: Mon, 17 Oct 2022 22:41:18 -0700
Subject: [PATCH 10/10] Use POSIX semantics for unlink() and rename() on
 Windows.

Use SetInformationByHandle() directly to implement unlink and rename
with POSIX semantics.

XXX Does this work on all relevant filesystems, and if not, how does it
fail?

Author: Victor Spirin <v.spi...@postgrespro.ru>
Author: Thomas Munro <thomas.mu...@gmail.com>
Discussion: https://postgr.es/m/a529b660-da15-5b62-21a0-9936768210fd%40postgrespro.ru
---
 src/port/dirmod.c           | 163 +++++++++++++++++++++++-------------
 src/port/t/001_filesystem.c |  31 +++----
 2 files changed, 121 insertions(+), 73 deletions(-)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index f4abde97bf..b96b0dc943 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -39,8 +39,13 @@
 #endif
 #endif
 
-#if defined(WIN32) && !defined(__CYGWIN__)
-#include "port/win32ntdll.h"
+#if defined(WIN32)
+/*
+ * XXX figure out how to get these from a system header
+ * https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/ntddk/ns-ntddk-_file_disposition_information_ex
+ */
+#define FILE_DISPOSITION_DELETE 0x00000001
+#define FILE_DISPOSITION_POSIX_SEMANTICS 0x00000002
 #endif
 
 #if defined(WIN32) || defined(__CYGWIN__)
@@ -48,6 +53,53 @@
 /* Externally visable only to allow testing. */
 int pgwin32_dirmod_loops = 100;
 
+#ifdef WIN32
+
+typedef struct FILE_RENAME_INFO_EXT {
+	FILE_RENAME_INFO fri;
+	wchar_t extra_space[MAXPGPATH];
+} FILE_RENAME_INFO_EXT;
+
+static int
+pgwin32_posix_rename(const char *from, const char *to)
+{
+	FILE_RENAME_INFO_EXT rename_info = {{0}};
+	HANDLE handle;
+
+	if (MultiByteToWideChar(CP_ACP, 0, to, -1, rename_info.fri.FileName, MAXPGPATH) == 0)
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+	rename_info.fri.ReplaceIfExists = true;
+	rename_info.fri.Flags = FILE_RENAME_FLAG_POSIX_SEMANTICS | FILE_RENAME_FLAG_REPLACE_IF_EXISTS;
+	rename_info.fri.FileNameLength = wcslen(rename_info.fri.FileName);
+
+	handle = CreateFile(from,
+						GENERIC_READ | GENERIC_WRITE | DELETE,
+						FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
+						NULL,
+						OPEN_EXISTING,
+						FILE_FLAG_BACKUP_SEMANTICS | FILE_FLAG_OPEN_REPARSE_POINT,
+						NULL);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+
+	if (!SetFileInformationByHandle(handle, FileRenameInfoEx, &rename_info, sizeof(FILE_RENAME_INFO_EXT)))
+	{
+		_dosmaperr(GetLastError());
+		CloseHandle(handle);
+		return -1;
+	}
+	CloseHandle(handle);
+	return 0;
+}
+
+#endif
+
 /*
  *	pgrename
  */
@@ -64,7 +116,7 @@ pgrename(const char *from, const char *to)
 	 * and blocking other backends.
 	 */
 #if defined(WIN32) && !defined(__CYGWIN__)
-	while (!MoveFileEx(from, to, MOVEFILE_REPLACE_EXISTING))
+	while (pgwin32_posix_rename(from, to) < 0)
 #else
 	while (rename(from, to) < 0)
 #endif
@@ -98,70 +150,61 @@ pgrename(const char *from, const char *to)
 	return 0;
 }
 
-/*
- * Check if _pglstat64()'s reason for failure was STATUS_DELETE_PENDING.
- * This doesn't apply to Cygwin, which has its own lstat() that would report
- * the case as EACCES.
-*/
-static bool
-lstat_error_was_status_delete_pending(void)
-{
-	if (errno != ENOENT)
-		return false;
 #if defined(WIN32) && !defined(__CYGWIN__)
-	if (pg_RtlGetLastNtStatus() == STATUS_DELETE_PENDING)
-		return true;
-#endif
-	return false;
+
+static int
+pgwin32_posix_unlink(const char *path)
+{
+	BY_HANDLE_FILE_INFORMATION info;
+	HANDLE handle;
+	ULONG flags;
+
+	flags = FILE_DISPOSITION_DELETE | FILE_DISPOSITION_POSIX_SEMANTICS;
+	handle = CreateFile(path,
+						GENERIC_READ | GENERIC_WRITE | DELETE,
+						FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
+						NULL,
+						OPEN_EXISTING,
+						FILE_FLAG_BACKUP_SEMANTICS | FILE_FLAG_OPEN_REPARSE_POINT,
+						NULL);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		_dosmaperr(GetLastError());
+		return -1;
+	}
+	if (!GetFileInformationByHandle(handle, &info))
+	{
+		_dosmaperr(GetLastError());
+		CloseHandle(handle);
+		return -1;
+	}
+	/* Let junction points be unlinked this way, but not directories. */
+	if ((info.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) &&
+		!(info.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT))
+	{
+		CloseHandle(handle);
+		errno = EPERM;
+		return -1;
+	}
+	if (!SetFileInformationByHandle(handle, FileDispositionInfoEx, &flags, sizeof(flags)))
+	{
+		_dosmaperr(GetLastError());
+		CloseHandle(handle);
+		return -1;
+	}
+	CloseHandle(handle);
+	return 0;
 }
 
+#endif
+
 /*
  *	pgunlink
  */
 int
 pgunlink(const char *path)
 {
-	bool		is_lnk;
 	int			loops = 0;
-	struct stat st;
-
-	/*
-	 * This function might be called for a regular file or for a junction
-	 * point (which we use to emulate symlinks).  The latter must be unlinked
-	 * with rmdir() on Windows.  Before we worry about any of that, let's see
-	 * if we can unlink directly, since that's expected to be the most common
-	 * case.
-	 */
-	if (unlink(path) == 0)
-		return 0;
-	if (errno != EACCES)
-		return -1;
-
-	/*
-	 * EACCES is reported for many reasons including unlink() of a junction
-	 * point.  Check if that's the case so we can redirect to rmdir().
-	 *
-	 * Note that by checking only once, we can't cope with a path that changes
-	 * from regular file to junction point underneath us while we're retrying
-	 * due to sharing violations, but that seems unlikely.  We could perhaps
-	 * prevent that by holding a file handle ourselves across the lstat() and
-	 * the retry loop, but that seems like over-engineering for now.
-	 *
-	 * In the special case of a STATUS_DELETE_PENDING error (file already
-	 * unlinked, but someone still has it open), we don't want to report ENOENT
-	 * to the caller immediately, because rmdir(parent) would probably fail.
-	 * We want to wait until the file truly goes away so that simple recursive
-	 * directory unlink algorithms work.
-	 */
-	if (lstat(path, &st) < 0)
-	{
-		if (lstat_error_was_status_delete_pending())
-			is_lnk = false;
-		else
-			return -1;
-	}
-	else
-		is_lnk = S_ISLNK(st.st_mode);
 
 	/*
 	 * We need to loop because even though PostgreSQL uses flags that allow
@@ -170,7 +213,11 @@ pgunlink(const char *path)
 	 * someone else to close the file, as the caller might be holding locks
 	 * and blocking other backends.
 	 */
-	while ((is_lnk ? rmdir(path) : unlink(path)) < 0)
+#ifdef WIN32
+	while (pgwin32_posix_unlink(path) < 0)
+#else
+	while (unlink(path) < 0)
+#endif
 	{
 		if (errno != EACCES)
 			return -1;
diff --git a/src/port/t/001_filesystem.c b/src/port/t/001_filesystem.c
index 346978783a..e7997278ec 100644
--- a/src/port/t/001_filesystem.c
+++ b/src/port/t/001_filesystem.c
@@ -438,10 +438,10 @@ filesystem_metadata_tests(void)
 	 * Linux), though AIX/JFS1 is rumored to succeed.  However, our Windows
 	 * emulation doesn't allow it, because we want to avoid surprises by
 	 * behaving like nearly all Unix systems.  So we check this on Windows
-	 * only, where it fails with non-standard EACCES.
+	 * only, where our wrapper fails with EPERM.
 	 */
 	PG_EXPECT_SYS(unlink(path2) == -1, "Windows: can't unlink() a directory");
-	PG_EXPECT_EQ(errno, EACCES);
+	PG_EXPECT_EQ(errno, EPERM);
 #endif
 
 #ifdef WIN32
@@ -535,21 +535,22 @@ filesystem_metadata_tests(void)
 	fd = open(path2, O_RDWR | PG_BINARY, 0777);
 	PG_EXPECT_SYS(fd >= 0, "open name2.txt");
 	make_path(path2, "name2.txt");
-#ifdef WIN32
 
-	/*
-	 * Windows can't rename over an open non-unlinked file, even with
-	 * have_posix_unlink_semantics.
-	 */
-	pgwin32_dirmod_loops = 2;	/* minimize looping to fail fast in testing */
-	PG_EXPECT_SYS(rename(path, path2) == -1,
-				  "Windows: can't rename name1.txt -> name2.txt while name2.txt is open");
-	PG_EXPECT_EQ(errno, EACCES);
-	PG_EXPECT_SYS(unlink(path) == 0, "unlink name1.txt");
-#else
-	PG_EXPECT_SYS(rename(path, path2) == 0,
-				  "POSIX: can rename name1.txt -> name2.txt while name2.txt is open");
+	if (!have_posix_unlink_semantics)
+	{
+#ifdef WIN32
+		pgwin32_dirmod_loops = 2;	/* minimize looping to fail fast in testing */
 #endif
+		PG_EXPECT_SYS(rename(path, path2) == -1,
+					  "Windows non-POSIX: can't rename name1.txt -> name2.txt while name2.txt is open");
+		PG_EXPECT_EQ(errno, EACCES);
+		PG_EXPECT_SYS(unlink(path) == 0, "unlink name1.txt");
+	}
+	else
+	{
+		PG_EXPECT_SYS(rename(path, path2) == 0,
+					  "POSIX: can rename name1.txt -> name2.txt while name2.txt is open");
+	}
 	PG_EXPECT_SYS(close(fd) == 0);
 
 	make_path(path, "name1.txt");
-- 
2.37.3

Reply via email to