Here is a new attempt to fix this mess.  Disclaimer: this based
entirely on reading the manual and vicariously hacking a computer I
don't have via CI.

The two basic ideas are:

 * keep per-socket event handles in a hash table
 * add our own level-triggered event memory

The socket table entries are reference counted, and exist as long as
the socket is currently in at least one WaitEventSet.  When creating a
new entry, extra polling logic re-checks the initial level-triggered
state (an overhead that we had in an ad-hoc way already, and that can
be avoided by more widespread use of long lived WaitEventSet).  You
are not allowed to close a socket while it's in a WaitEventSet,
because then a new socket could be allocated with the same number and
chaos would ensue.  For example, if we revive the idea of hooking
libpq connections up to long-lived WaitEventSets, we'll probably need
to invent a libpq event callback that says 'I am going to close socket
X!', so you have a chance to remove the socket from any WaitEventSet
*before* it's closed, to maintain that invariant.  Other lazier ideas
are possible, but probably become impossible in a hypothetical
multi-threaded future.

With these changes, AFAIK it should be safe to reinstate graceful
socket shutdowns, to fix the field complaints about FATAL error
messages being eaten by a grue and the annoying random CI/BF failures.

Here are some other ideas that I considered but rejected for now:

1.  We could throw the WAIT_USE_WIN32 code away, and hack
WAIT_USE_POLL to use WSAPoll() on Windows; we could create a
'self-pipe' using a pair of connected AF_UNIX sockets to implement
latches and fake signals.  It seems like a lot of work, and makes
latches a bit worse (instead of "everything is an event!" we have
"everything is a socket!" with a helper thread, and we don't even have
socketpair() on this OS).  Blah.

2.  We could figure out how to do fancy asynchronous sockets and IOCP.
That's how NT really wants to talk to the world, it doesn't really
want to pretend to be Unix.  I expect that is where we'll get to
eventually but it's a much bigger cross-platform R&D job.

3.  Maybe there is a kind of partial step towards idea 2 that Andres
mentioned on another thread somewhere: one could use an IOCP, and then
use event callbacks that run on system threads to post IOCP messages
(a bit like we do for our fake waitpid()).

What I have here is the simplest way I could see to patch up what we
already have, with the idea that in the fullness of time we'll
eventually get around to idea 2, once someone is ready to do the
press-ups.

Review/poking-with-a-stick/trying-to-break-it most welcome.
From 4614df55bd0e98b70b942543482e6a9eb767f718 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 1 Nov 2023 05:53:12 +1300
Subject: [PATCH v2 1/6] simplehash: Allow raw memory to be freed.

Commit 48995040d5e introduced SH_RAW_ALLOCATOR, but assumed that memory
allocated that way could be freed with pfree().  Allow SH_RAW_FREE to be
defined too, for cases where that isn't true.
---
 src/include/lib/simplehash.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index b7adc16b80..cd354e2f11 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -42,6 +42,7 @@
  *		declarations reside
  *	  - SH_RAW_ALLOCATOR - if defined, memory contexts are not used; instead,
  *	    use this to allocate bytes. The allocator must zero the returned space.
+ *	  - SH_RAW_FREE - free operation corresponding to SH_RAW_ALLOCATOR
  *	  - SH_USE_NONDEFAULT_ALLOCATOR - if defined no element allocator functions
  *		are defined, so you can supply your own
  *	  The following parameters are only relevant when SH_DEFINE is defined:
@@ -410,7 +411,11 @@ SH_ALLOCATE(SH_TYPE * type, Size size)
 static inline void
 SH_FREE(SH_TYPE * type, void *pointer)
 {
+#ifdef SH_RAW_FREE
+	SH_RAW_FREE(pointer);
+#else
 	pfree(pointer);
+#endif
 }
 
 #endif
@@ -458,7 +463,11 @@ SH_SCOPE void
 SH_DESTROY(SH_TYPE * tb)
 {
 	SH_FREE(tb, tb->data);
+#ifdef SH_RAW_FREE
+	SH_RAW_FREE(tb);
+#else
 	pfree(tb);
+#endif
 }
 
 /* reset the contents of a previously created hash table */
-- 
2.42.0

From 3ee479b182ac53e4839da1156ef30cd9f2523de7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 1 Nov 2023 06:51:38 +1300
Subject: [PATCH v2 2/6] simplehash: Allow raw allocation to fail.

Commit 48995040d5e allowed for raw allocators to be used instead of the
MemoryContext API, but didn't contemplate allocation failure.  Teach the
grow and insert operations to report failure to the caller.
---
 src/include/lib/simplehash.h | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index cd354e2f11..d1034baf67 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -205,7 +205,7 @@ SH_SCOPE void SH_DESTROY(SH_TYPE * tb);
 SH_SCOPE void SH_RESET(SH_TYPE * tb);
 
 /* void <prefix>_grow(<prefix>_hash *tb, uint64 newsize) */
-SH_SCOPE void SH_GROW(SH_TYPE * tb, uint64 newsize);
+SH_SCOPE bool SH_GROW(SH_TYPE * tb, uint64 newsize);
 
 /* <element> *<prefix>_insert(<prefix>_hash *tb, <key> key, bool *found) */
 SH_SCOPE	SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found);
@@ -442,6 +442,8 @@ SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data)
 
 #ifdef SH_RAW_ALLOCATOR
 	tb = (SH_TYPE *) SH_RAW_ALLOCATOR(sizeof(SH_TYPE));
+	if (!tb)
+		return NULL;
 #else
 	tb = (SH_TYPE *) MemoryContextAllocZero(ctx, sizeof(SH_TYPE));
 	tb->ctx = ctx;
@@ -454,6 +456,17 @@ SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data)
 	SH_COMPUTE_PARAMETERS(tb, size);
 
 	tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
+#ifdef SH_RAW_ALLOCATOR
+	if (!tb->data)
+	{
+#ifdef SH_RAW_FREE
+		SH_RAW_FREE(tb);
+#else
+		pfree(tb);
+#endif
+		return NULL;
+	}
+#endif
 
 	return tb;
 }
@@ -485,7 +498,7 @@ SH_RESET(SH_TYPE * tb)
  * necessary. But resizing to the exact input size can be advantageous
  * performance-wise, when known at some point.
  */
-SH_SCOPE void
+SH_SCOPE bool
 SH_GROW(SH_TYPE * tb, uint64 newsize)
 {
 	uint64		oldsize = tb->size;
@@ -502,9 +515,13 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 	/* compute parameters for new table */
 	SH_COMPUTE_PARAMETERS(tb, newsize);
 
-	tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
+	newdata = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * tb->size);
+#ifdef SH_RAW_ALLOCATOR
+	if (!newdata)
+		return false;
+#endif
 
-	newdata = tb->data;
+	tb->data = newdata;
 
 	/*
 	 * Copy entries from the old data to newdata. We theoretically could use
@@ -589,6 +606,8 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 	}
 
 	SH_FREE(tb, olddata);
+
+	return true;
 }
 
 /*
@@ -623,7 +642,11 @@ restart:
 		 * When optimizing, it can be very useful to print these out.
 		 */
 		/* SH_STAT(tb); */
-		SH_GROW(tb, tb->size * 2);
+		if (!SH_GROW(tb, tb->size * 2))
+		{
+			*found = false;
+			return NULL;
+		}
 		/* SH_STAT(tb); */
 	}
 
-- 
2.42.0

From 03a5bd46e5cdf1e551a64b16bb8915aead67ad85 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 19 Mar 2023 16:07:20 +1300
Subject: [PATCH v2 3/6] Redesign Windows socket event management.

Previously, we created a Winsock event handle for each socket in each
WaitEventSet, and then we translated an FD_CLOSE event directly to
WL_SOCKET_READABLE.  Since FD_CLOSE is reported only once when the
remote end shuts down gracefully, we could hang in rare scenarios where
backend code relies on WL_SOCKET_READABLE being level-triggered.

We got away with this in the past when the thing on the other end of the
socket was another PostgreSQL server (ie via postgres_fdw, replication
etc), because the remote server would exit without shutting down or
closing its socket, and that produces a repeating 'abortive' FD_CLOSE.
We'd like to change that as it also eats error messages, producing user
complaints and random CI failures, but that's a sepaarate issue and
we'll need to fix this first.

New design:

* for each socket, we now create just one event handle to be used by
  all WaitEventSet objects that are interested in the socket

* for each socket, we now track a set of sticky events that are reported
  as poll() would until they are cleared by either the send()/recv()
  wrappers, or failing that by an explicit re-check

The lifetime management of event handles and associated state is done
by reference counting.
---
 src/backend/port/win32/socket.c  | 364 +++++++++++++++++++++++++++++++
 src/backend/storage/ipc/latch.c  | 212 ++++++------------
 src/include/port/win32_port.h    |   6 +
 src/include/storage/latch.h      |   3 -
 src/tools/pgindent/typedefs.list |   1 +
 5 files changed, 441 insertions(+), 145 deletions(-)

diff --git a/src/backend/port/win32/socket.c b/src/backend/port/win32/socket.c
index 9c339397d1..a7fa98cb1d 100644
--- a/src/backend/port/win32/socket.c
+++ b/src/backend/port/win32/socket.c
@@ -13,6 +13,8 @@
 
 #include "postgres.h"
 
+#include "common/hashfn.h"
+
 /*
  * Indicate if pgwin32_recv() and pgwin32_send() should operate
  * in non-blocking mode.
@@ -37,6 +39,77 @@ int			pgwin32_noblock = 0;
 #undef recv
 #undef send
 
+/*
+ * An entry in our socket table.
+ */
+typedef struct SocketTableEntry
+{
+	SOCKET		sock;
+	char		status;
+
+	/*
+	 * The reference count for the event handle.  Client code that wants to
+	 * use the event functions must acquire a reference and release it when
+	 * finished.
+	 */
+	int			reference_count;
+
+	/*
+	 * The FD_XXX events that were most recently selected for this socket
+	 * number with WSAEventSelect().
+	 */
+	int			selected_events;
+
+	/*
+	 * The FD_XXX events already reported by Winsock, that we'll continue to
+	 * report as long as they are true.  They are cleared by our send/recv
+	 * wrappers, because those are 're-enabling' functions that will cause
+	 * Winsock to report them again.  The are also cleared by an explicit
+	 * check we perform for the benefit of hypothetical code that might be
+	 * reach Winsock send/recv wrappers without going via our wrappers.
+	 */
+	int			level_triggered_events;
+
+	/*
+	 * Windows kernel event most recently associated with the socket number.
+	 */
+	HANDLE		event_handle;
+} SocketTableEntry;
+
+static inline void *
+malloc0(size_t size)
+{
+	void	   *result;
+
+	result = malloc(size);
+	if (result)
+		memset(result, 0, size);
+
+	return result;
+}
+
+/*
+ * It almost seems feasible to use an array to store our per-socket state,
+ * based on the observation that Windows socket descriptors seem to be small
+ * integers as on Unix, but the manual warns against making that assumption.
+ * So we use a hash table.
+ */
+
+#define SH_PREFIX socket_table
+#define SH_ELEMENT_TYPE SocketTableEntry
+#define SH_RAW_ALLOCATOR malloc0
+#define SH_RAW_FREE free
+#define SH_SCOPE static inline
+#define SH_KEY_TYPE SOCKET
+#define SH_KEY sock
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_EQUAL(tb, a, b) (a) == (b)
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static socket_table_hash * socket_table;
+
 /*
  * Blocking socket functions implemented so they listen on both
  * the socket and the signal event, required for signal handling.
@@ -310,6 +383,265 @@ pgwin32_socket(int af, int type, int protocol)
 	return s;
 }
 
+/*
+ * Check if any of FD_READ, FD_WRITE or FD_CLOSE is still true.  Used to
+ * re-check level-triggered events.
+ */
+static int
+pgwin32_socket_poll(SOCKET s, int events)
+{
+	int			revents = 0;
+
+	if (events & (FD_READ | FD_CLOSE))
+	{
+		ssize_t		rc;
+		char		c;
+
+		rc = recv(s, &c, 1, MSG_PEEK);
+		if (rc == 1)
+		{
+			/* At least one byte to read. */
+			if (events & FD_READ)
+				revents |= FD_READ;
+		}
+		else if (rc == 0 || WSAGetLastError() != WSAEWOULDBLOCK)
+		{
+			/* EOF due to graceful shutdown, or error. */
+			if (events & FD_CLOSE)
+				revents |= FD_CLOSE;
+		}
+	}
+
+	if (events & FD_WRITE)
+	{
+		char		c;
+
+		/* If it looks like we could write or get an error, report that. */
+		if (send(s, &c, 0, 0) == 0 || WSAGetLastError() != WSAEWOULDBLOCK)
+			revents |= FD_WRITE;
+	}
+
+	return revents;
+}
+
+/*
+ * Adjust the set of FD_XXX events this socket's event handle should wake up
+ * for.  Returns 0 on success, otherwise -1 and sets errno.
+ */
+int
+pgwin32_socket_select_events(SOCKET s, int selected_events)
+{
+	SocketTableEntry *entry;
+
+	Assert(socket_table);
+	entry = socket_table_lookup(socket_table, s);
+
+	Assert(entry);
+	Assert(entry->reference_count > 0);
+	Assert(entry->event_handle != WSA_INVALID_EVENT);
+
+	/* Do nothing if no change. */
+	if (selected_events == entry->selected_events)
+		return 0;
+
+	/*
+	 * Tell Winsock to link the socket to the event handle, and which events
+	 * we're interested in.
+	 */
+	if (WSAEventSelect(s, entry->event_handle, selected_events) == SOCKET_ERROR)
+	{
+		TranslateSocketError();
+		return -1;
+	}
+
+	entry->selected_events = selected_events;
+
+	/*
+	 * The manual tells us: "Issuing a WSAEventSelect for a socket cancels any
+	 * previous WSAAsyncSelect or WSAEventSelect for the same socket and
+	 * clears the internal network event record."  If that is true, we might
+	 * have wiped an internal flag we're interested in.  Close that race by
+	 * triggering an explicit poll before we sleep, by pretending we have seen
+	 * all of these events.
+	 */
+	if (selected_events & (FD_READ | FD_WRITE))
+		entry->level_triggered_events = selected_events & (FD_READ | FD_WRITE | FD_CLOSE);
+	else
+		entry->level_triggered_events = 0;
+
+	return 0;
+}
+
+/*
+ * Before waiting on the event handle, check if we have pending
+ * level-triggered events that are still true, and if so take measures to
+ * prevent the sleep.
+ */
+void
+pgwin32_socket_prepare_to_wait(SOCKET s)
+{
+	SocketTableEntry *entry;
+
+	Assert(socket_table);
+	entry = socket_table_lookup(socket_table, s);
+
+	Assert(entry);
+	Assert(entry->reference_count > 0);
+	Assert(entry->event_handle != WSA_INVALID_EVENT);
+
+	/*
+	 * If we're not waiting for FD_READ or FD_WRITE, don't try to poll the
+	 * socket.  Server sockets and client sockets that haven't connected yet
+	 * can't be polled by that technique.
+	 */
+	if ((entry->selected_events & (FD_READ | FD_WRITE)) &&
+		entry->level_triggered_events != 0)
+	{
+		/*
+		 * Re-check the level-triggered events we have recorded.  This is
+		 * necessary because someone might access WSASend()/WSARecv() directly
+		 * without going via our wrapper functions, so they might never be
+		 * cleared otherwise.
+		 */
+		entry->level_triggered_events =
+			pgwin32_socket_poll(s,
+								entry->level_triggered_events & entry->selected_events);
+		if (entry->level_triggered_events)
+		{
+			/*
+			 * At least one readiness condition is still true.  Prevent
+			 * sleeping, and let pgwin32_socket_enumerate_events() report
+			 * these level-triggered events.
+			 */
+			WSASetEvent(entry->event_handle);
+		}
+	}
+}
+
+/*
+ * After the Windows event handle has been signaled, this function can be
+ * called to find out which socket events occurred, and atomically reset the
+ * event handle for the next sleep.
+ *
+ * The events returned are also remembered in our level-triggered event mask,
+ * so they'll prevent sleeping and be reported again as long as they remain
+ * true.
+ */
+int
+pgwin32_socket_enumerate_events(SOCKET s)
+{
+	WSANETWORKEVENTS new_events = {0};
+	SocketTableEntry *entry;
+	int			result;
+
+	Assert(socket_table);
+	entry = socket_table_lookup(socket_table, s);
+
+	Assert(entry);
+	Assert(entry->reference_count > 0);
+	Assert(entry->event_handle != WSA_INVALID_EVENT);
+
+	/*
+	 * Atomically consume the internal network event record and reset the
+	 * associated event handle.  This guarantees that we can't miss future
+	 * wakeups.
+	 */
+	if (WSAEnumNetworkEvents(s, entry->event_handle, &new_events) != 0)
+	{
+		TranslateSocketError();
+		return -1;
+	}
+
+	/* Add any events pgwin32_socket_prepare_to_wait() decided to feed us. */
+	result = entry->level_triggered_events | new_events.lNetworkEvents;
+
+	/* Remember certain events for next time around. */
+	if (entry->selected_events & (FD_READ | FD_WRITE))
+		entry->level_triggered_events = result & (FD_READ | FD_WRITE | FD_CLOSE);
+	else
+		entry->level_triggered_events = 0;
+
+	return result;
+}
+
+/*
+ * Acquire a reference-counted Windows event handle for this socket.  This can
+ * be used for waiting for socket events.  Returns NULL and sets errno on
+ * failure.
+ */
+HANDLE
+pgwin32_socket_acquire_event_handle(SOCKET s)
+{
+	SocketTableEntry *entry;
+	bool		found;
+
+	/* First-time initialization. */
+	if (unlikely(socket_table == NULL))
+	{
+		socket_table = socket_table_create(16, NULL);
+		if (socket_table == NULL)
+		{
+			errno = ENOMEM;
+			return NULL;
+		}
+	}
+
+	/* If we already have it, just bump the count. */
+	entry = socket_table_insert(socket_table, s, &found);
+	if (likely(found))
+	{
+		Assert(entry->event_handle != WSA_INVALID_EVENT);
+		entry->reference_count++;
+		return entry->event_handle;
+	}
+
+	/* Did we run out of memory? */
+	if (entry == NULL)
+	{
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	/* Allocate a new event handle. */
+	entry->event_handle = WSACreateEvent();
+	if (entry->event_handle == WSA_INVALID_EVENT)
+	{
+		socket_table_delete_item(socket_table, entry);
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	entry->selected_events = 0;
+	entry->level_triggered_events = 0;
+	entry->reference_count = 1;
+
+	return entry->event_handle;
+}
+
+/*
+ * Release a reference-counted event handle.
+ */
+void
+pgwin32_socket_release_event_handle(SOCKET s)
+{
+	SocketTableEntry *entry;
+
+	Assert(socket_table);
+	entry = socket_table_lookup(socket_table, s);
+
+	Assert(entry);
+	Assert(entry->reference_count > 0);
+	Assert(entry->event_handle != WSA_INVALID_EVENT);
+
+	if (--entry->reference_count == 0)
+	{
+		WSACloseEvent(entry->event_handle);
+		socket_table_delete_item(socket_table, entry);
+
+		/* XXX Free socket_table if it is empty? */
+	}
+}
+
 int
 pgwin32_bind(SOCKET s, struct sockaddr *addr, int addrlen)
 {
@@ -402,6 +734,22 @@ pgwin32_recv(SOCKET s, char *buf, int len, int f)
 		return -1;
 	}
 
+	/*
+	 * WSARecv() is a re-enabling function for Winsock's FD_READ event, so it
+	 * is now safe to clear our level-triggered flag.  This is only an
+	 * optimization for a common case, and not required for correctness.  If
+	 * someone calls WSARecv() directly instead of going through this wrapper,
+	 * pgwin32_socket_prepare_to_wait() will figure that out and clear it
+	 * anyway.
+	 */
+	if (socket_table)
+	{
+		SocketTableEntry *entry = socket_table_lookup(socket_table, s);
+
+		if (entry)
+			entry->level_triggered_events &= ~FD_READ;
+	}
+
 	if (pgwin32_noblock)
 	{
 		/*
@@ -485,6 +833,22 @@ pgwin32_send(SOCKET s, const void *buf, int len, int flags)
 			return -1;
 		}
 
+		/*
+		 * WSASend() is a re-enabling function for Winsock's FD_WRITE event,
+		 * so it is now safe to clear our level-triggered flag.  This is only
+		 * an optimization for a common case, and not required for
+		 * correctness.  If someone calls WSASend() directly instead of going
+		 * through this wrapper, pgwin32_socket_prepare_to_wait() will figure
+		 * that out and clear it anyway.
+		 */
+		if (socket_table)
+		{
+			SocketTableEntry *entry = socket_table_lookup(socket_table, s);
+
+			if (entry)
+				entry->level_triggered_events &= ~FD_WRITE;
+		}
+
 		if (pgwin32_noblock)
 		{
 			/*
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 2fd386a4ed..5bf03a3cd9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -847,20 +847,9 @@ FreeWaitEventSet(WaitEventSet *set)
 		 cur_event < (set->events + set->nevents);
 		 cur_event++)
 	{
-		if (cur_event->events & WL_LATCH_SET)
-		{
-			/* uses the latch's HANDLE */
-		}
-		else if (cur_event->events & WL_POSTMASTER_DEATH)
-		{
-			/* uses PostmasterHandle */
-		}
-		else
-		{
-			/* Clean up the event object we created for the socket */
-			WSAEventSelect(cur_event->fd, NULL, 0);
-			WSACloseEvent(set->handles[cur_event->pos + 1]);
-		}
+		/* Release reference to socket's event handle. */
+		if (cur_event->events & WL_SOCKET_MASK)
+			pgwin32_socket_release_event_handle(cur_event->fd);
 	}
 #endif
 
@@ -955,9 +944,6 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch,
 	event->fd = fd;
 	event->events = events;
 	event->user_data = user_data;
-#ifdef WIN32
-	event->reset = false;
-#endif
 
 	if (events == WL_LATCH_SET)
 	{
@@ -976,10 +962,21 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch,
 	}
 	else if (events == WL_POSTMASTER_DEATH)
 	{
-#ifndef WIN32
+#if defined(WAIT_USE_WIN32)
+		set->handles[event->pos + 1] = PostmasterHandle;
+		event->fd = PGINVALID_SOCKET;
+#else
 		event->fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
 #endif
 	}
+	else if (events & WL_SOCKET_MASK)
+	{
+#if defined(WAIT_USE_WIN32)
+		set->handles[event->pos + 1] = pgwin32_socket_acquire_event_handle(fd);
+		if (!set->handles[event->pos + 1])
+			elog(ERROR, "could not acquire socket event handle: %m");
+#endif
+	}
 
 	/* perform wait primitive specific initialization, if needed */
 #if defined(WAIT_USE_EPOLL)
@@ -1322,45 +1319,52 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events)
 #endif
 
 #if defined(WAIT_USE_WIN32)
+static int
+ToWinsockEvents(int pg_events)
+{
+	int			winsock_events = 0;
+
+	if (pg_events & WL_SOCKET_READABLE)
+		winsock_events |= FD_CLOSE | FD_READ;
+	if (pg_events & WL_SOCKET_WRITEABLE)
+		winsock_events |= FD_CLOSE | FD_WRITE;
+	if (pg_events & WL_SOCKET_CONNECTED)
+		winsock_events |= FD_CLOSE | FD_CONNECT;
+	if (pg_events & WL_SOCKET_ACCEPT)
+		winsock_events |= FD_CLOSE | FD_ACCEPT;
+
+	return winsock_events;
+}
+
+static int
+FromWinsockEvents(int winsock_events)
+{
+	int			pg_events = 0;
+
+	if (winsock_events & (FD_CLOSE | FD_READ))
+		pg_events |= WL_SOCKET_READABLE;
+	if (winsock_events & (FD_CLOSE | FD_WRITE))
+		pg_events |= WL_SOCKET_WRITEABLE;
+	if (winsock_events & (FD_CLOSE | FD_CONNECT))
+		pg_events |= WL_SOCKET_CONNECTED;
+	if (winsock_events & (FD_CLOSE | FD_ACCEPT))
+		pg_events |= WL_SOCKET_ACCEPT;
+
+	return pg_events;
+}
+
 static void
 WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
 {
-	HANDLE	   *handle = &set->handles[event->pos + 1];
-
-	if (event->events == WL_LATCH_SET)
+	if (event->events & WL_LATCH_SET)
 	{
-		Assert(set->latch != NULL);
-		*handle = set->latch->event;
+		set->handles[event->pos + 1] = set->latch->event;
 	}
-	else if (event->events == WL_POSTMASTER_DEATH)
-	{
-		*handle = PostmasterHandle;
-	}
-	else
+	else if (event->events & WL_SOCKET_MASK)
 	{
-		int			flags = FD_CLOSE;	/* always check for errors/EOF */
-
-		if (event->events & WL_SOCKET_READABLE)
-			flags |= FD_READ;
-		if (event->events & WL_SOCKET_WRITEABLE)
-			flags |= FD_WRITE;
-		if (event->events & WL_SOCKET_CONNECTED)
-			flags |= FD_CONNECT;
-		if (event->events & WL_SOCKET_ACCEPT)
-			flags |= FD_ACCEPT;
-
-		if (*handle == WSA_INVALID_EVENT)
-		{
-			*handle = WSACreateEvent();
-			if (*handle == WSA_INVALID_EVENT)
-				elog(ERROR, "failed to create event for socket: error code %d",
-					 WSAGetLastError());
-		}
-		if (WSAEventSelect(event->fd, *handle, flags) != 0)
-			elog(ERROR, "failed to set up event for socket: error code %d",
-				 WSAGetLastError());
-
-		Assert(event->fd != PGINVALID_SOCKET);
+		if (pgwin32_socket_select_events(event->fd,
+										 ToWinsockEvents(event->events)) < 0)
+			elog(ERROR, "failed to set up event for socket: %m");
 	}
 }
 #endif
@@ -1945,48 +1949,16 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	DWORD		rc;
 	WaitEvent  *cur_event;
 
-	/* Reset any wait events that need it */
+	/*
+	 * Allow level-triggered events to be signaled, causing
+	 * WaitForMultipleObjects() to return immediately.
+	 */
 	for (cur_event = set->events;
 		 cur_event < (set->events + set->nevents);
 		 cur_event++)
 	{
-		if (cur_event->reset)
-		{
-			WaitEventAdjustWin32(set, cur_event);
-			cur_event->reset = false;
-		}
-
-		/*
-		 * Windows does not guarantee to log an FD_WRITE network event
-		 * indicating that more data can be sent unless the previous send()
-		 * failed with WSAEWOULDBLOCK.  While our caller might well have made
-		 * such a call, we cannot assume that here.  Therefore, if waiting for
-		 * write-ready, force the issue by doing a dummy send().  If the dummy
-		 * send() succeeds, assume that the socket is in fact write-ready, and
-		 * return immediately.  Also, if it fails with something other than
-		 * WSAEWOULDBLOCK, return a write-ready indication to let our caller
-		 * deal with the error condition.
-		 */
-		if (cur_event->events & WL_SOCKET_WRITEABLE)
-		{
-			char		c;
-			WSABUF		buf;
-			DWORD		sent;
-			int			r;
-
-			buf.buf = &c;
-			buf.len = 0;
-
-			r = WSASend(cur_event->fd, &buf, 1, &sent, 0, NULL, NULL);
-			if (r == 0 || WSAGetLastError() != WSAEWOULDBLOCK)
-			{
-				occurred_events->pos = cur_event->pos;
-				occurred_events->user_data = cur_event->user_data;
-				occurred_events->events = WL_SOCKET_WRITEABLE;
-				occurred_events->fd = cur_event->fd;
-				return 1;
-			}
-		}
+		if (cur_event->events & WL_SOCKET_MASK)
+			pgwin32_socket_prepare_to_wait(cur_event->fd);
 	}
 
 	/*
@@ -2067,64 +2039,20 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 		}
 		else if (cur_event->events & WL_SOCKET_MASK)
 		{
-			WSANETWORKEVENTS resEvents;
-			HANDLE		handle = set->handles[cur_event->pos + 1];
+			int			winsock_events;
+			int			pg_events;
 
 			Assert(cur_event->fd);
 
-			occurred_events->fd = cur_event->fd;
+			winsock_events = pgwin32_socket_enumerate_events(cur_event->fd);
+			if (winsock_events < 0)
+				elog(ERROR, "could not enumerate socket events: %m");
 
-			ZeroMemory(&resEvents, sizeof(resEvents));
-			if (WSAEnumNetworkEvents(cur_event->fd, handle, &resEvents) != 0)
-				elog(ERROR, "failed to enumerate network events: error code %d",
-					 WSAGetLastError());
-			if ((cur_event->events & WL_SOCKET_READABLE) &&
-				(resEvents.lNetworkEvents & FD_READ))
-			{
-				/* data available in socket */
-				occurred_events->events |= WL_SOCKET_READABLE;
-
-				/*------
-				 * WaitForMultipleObjects doesn't guarantee that a read event
-				 * will be returned if the latch is set at the same time.  Even
-				 * if it did, the caller might drop that event expecting it to
-				 * reoccur on next call.  So, we must force the event to be
-				 * reset if this WaitEventSet is used again in order to avoid
-				 * an indefinite hang.
-				 *
-				 * Refer
-				 * https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx
-				 * for the behavior of socket events.
-				 *------
-				 */
-				cur_event->reset = true;
-			}
-			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
-				(resEvents.lNetworkEvents & FD_WRITE))
-			{
-				/* writeable */
-				occurred_events->events |= WL_SOCKET_WRITEABLE;
-			}
-			if ((cur_event->events & WL_SOCKET_CONNECTED) &&
-				(resEvents.lNetworkEvents & FD_CONNECT))
-			{
-				/* connected */
-				occurred_events->events |= WL_SOCKET_CONNECTED;
-			}
-			if ((cur_event->events & WL_SOCKET_ACCEPT) &&
-				(resEvents.lNetworkEvents & FD_ACCEPT))
-			{
-				/* incoming connection could be accepted */
-				occurred_events->events |= WL_SOCKET_ACCEPT;
-			}
-			if (resEvents.lNetworkEvents & FD_CLOSE)
-			{
-				/* EOF/error, so signal all caller-requested socket flags */
-				occurred_events->events |= (cur_event->events & WL_SOCKET_MASK);
-			}
-
-			if (occurred_events->events != 0)
+			pg_events = FromWinsockEvents(winsock_events) & cur_event->events;
+			if (pg_events)
 			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events->events = pg_events;
 				occurred_events++;
 				returned_events++;
 			}
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 27a11c7868..a0ed6aaeaa 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -506,6 +506,12 @@ extern int	pgwin32_recv(SOCKET s, char *buf, int len, int flags);
 extern int	pgwin32_send(SOCKET s, const void *buf, int len, int flags);
 extern int	pgwin32_waitforsinglesocket(SOCKET s, int what, int timeout);
 
+extern HANDLE pgwin32_socket_acquire_event_handle(SOCKET s);
+extern void pgwin32_socket_release_event_handle(SOCKET s);
+extern int	pgwin32_socket_select_events(SOCKET s, int events);
+extern void pgwin32_socket_prepare_to_wait(SOCKET s);
+extern int	pgwin32_socket_enumerate_events(SOCKET s);
+
 extern PGDLLIMPORT int pgwin32_noblock;
 
 #endif							/* FRONTEND */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 99cc47874a..cbcc5ef23f 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -153,9 +153,6 @@ typedef struct WaitEvent
 	uint32		events;			/* triggered events */
 	pgsocket	fd;				/* socket fd associated with event */
 	void	   *user_data;		/* pointer provided in AddWaitEventToSet */
-#ifdef WIN32
-	bool		reset;			/* Is reset of the event required? */
-#endif
 } WaitEvent;
 
 /* forward declaration to avoid exposing latch.c implementation details */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bf50a32119..15dd7fa2b8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2585,6 +2585,7 @@ Snapshot
 SnapshotData
 SnapshotType
 SockAddr
+SocketTableEntry
 Sort
 SortBy
 SortByDir
-- 
2.42.0

From c561d1f7091368c5e09735ab099eb4a656587dfb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 10 Nov 2023 08:41:28 +1300
Subject: [PATCH v2 4/6] Remove pgwin32_select().

pgwin32_select(), used to replace select() in backend code on Windows,
would need to be updated to use the new per-socket event handles.  Since
the last remaining user of select() in the backend is scheduled for
replacement with the WaitEventSet API, it seems better to demolish it
instead.

Any extension code that is relying on select() with fake signals will
still compile, but will no longer respond to signals.  Hypothetical code
like that is probably buggy anyway, because backend code should also be
handling interrupts, and should switch to the various WaitEventSet APIs.
---
 src/backend/port/win32/socket.c | 200 --------------------------------
 src/include/port/win32_port.h   |   2 -
 2 files changed, 202 deletions(-)

diff --git a/src/backend/port/win32/socket.c b/src/backend/port/win32/socket.c
index a7fa98cb1d..d0cf08392f 100644
--- a/src/backend/port/win32/socket.c
+++ b/src/backend/port/win32/socket.c
@@ -867,203 +867,3 @@ pgwin32_send(SOCKET s, const void *buf, int len, int flags)
 
 	return -1;
 }
-
-
-/*
- * Wait for activity on one or more sockets.
- * While waiting, allow signals to run
- *
- * NOTE! Currently does not implement exceptfds check,
- * since it is not used in postgresql!
- */
-int
-pgwin32_select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, const struct timeval *timeout)
-{
-	WSAEVENT	events[FD_SETSIZE * 2]; /* worst case is readfds totally
-										 * different from writefds, so
-										 * 2*FD_SETSIZE sockets */
-	SOCKET		sockets[FD_SETSIZE * 2];
-	int			numevents = 0;
-	int			i;
-	int			r;
-	DWORD		timeoutval = WSA_INFINITE;
-	FD_SET		outreadfds;
-	FD_SET		outwritefds;
-	int			nummatches = 0;
-
-	Assert(exceptfds == NULL);
-
-	if (pgwin32_poll_signals())
-		return -1;
-
-	FD_ZERO(&outreadfds);
-	FD_ZERO(&outwritefds);
-
-	/*
-	 * Windows does not guarantee to log an FD_WRITE network event indicating
-	 * that more data can be sent unless the previous send() failed with
-	 * WSAEWOULDBLOCK.  While our caller might well have made such a call, we
-	 * cannot assume that here.  Therefore, if waiting for write-ready, force
-	 * the issue by doing a dummy send().  If the dummy send() succeeds,
-	 * assume that the socket is in fact write-ready, and return immediately.
-	 * Also, if it fails with something other than WSAEWOULDBLOCK, return a
-	 * write-ready indication to let our caller deal with the error condition.
-	 */
-	if (writefds != NULL)
-	{
-		for (i = 0; i < writefds->fd_count; i++)
-		{
-			char		c;
-			WSABUF		buf;
-			DWORD		sent;
-
-			buf.buf = &c;
-			buf.len = 0;
-
-			r = WSASend(writefds->fd_array[i], &buf, 1, &sent, 0, NULL, NULL);
-			if (r == 0 || WSAGetLastError() != WSAEWOULDBLOCK)
-				FD_SET(writefds->fd_array[i], &outwritefds);
-		}
-
-		/* If we found any write-ready sockets, just return them immediately */
-		if (outwritefds.fd_count > 0)
-		{
-			memcpy(writefds, &outwritefds, sizeof(fd_set));
-			if (readfds)
-				FD_ZERO(readfds);
-			return outwritefds.fd_count;
-		}
-	}
-
-
-	/* Now set up for an actual select */
-
-	if (timeout != NULL)
-	{
-		/* timeoutval is in milliseconds */
-		timeoutval = timeout->tv_sec * 1000 + timeout->tv_usec / 1000;
-	}
-
-	if (readfds != NULL)
-	{
-		for (i = 0; i < readfds->fd_count; i++)
-		{
-			events[numevents] = WSACreateEvent();
-			sockets[numevents] = readfds->fd_array[i];
-			numevents++;
-		}
-	}
-	if (writefds != NULL)
-	{
-		for (i = 0; i < writefds->fd_count; i++)
-		{
-			if (!readfds ||
-				!FD_ISSET(writefds->fd_array[i], readfds))
-			{
-				/* If the socket is not in the read list */
-				events[numevents] = WSACreateEvent();
-				sockets[numevents] = writefds->fd_array[i];
-				numevents++;
-			}
-		}
-	}
-
-	for (i = 0; i < numevents; i++)
-	{
-		int			flags = 0;
-
-		if (readfds && FD_ISSET(sockets[i], readfds))
-			flags |= FD_READ | FD_ACCEPT | FD_CLOSE;
-
-		if (writefds && FD_ISSET(sockets[i], writefds))
-			flags |= FD_WRITE | FD_CLOSE;
-
-		if (WSAEventSelect(sockets[i], events[i], flags) != 0)
-		{
-			TranslateSocketError();
-			/* release already-assigned event objects */
-			while (--i >= 0)
-				WSAEventSelect(sockets[i], NULL, 0);
-			for (i = 0; i < numevents; i++)
-				WSACloseEvent(events[i]);
-			return -1;
-		}
-	}
-
-	events[numevents] = pgwin32_signal_event;
-	r = WaitForMultipleObjectsEx(numevents + 1, events, FALSE, timeoutval, TRUE);
-	if (r != WAIT_TIMEOUT && r != WAIT_IO_COMPLETION && r != (WAIT_OBJECT_0 + numevents))
-	{
-		/*
-		 * We scan all events, even those not signaled, in case more than one
-		 * event has been tagged but Wait.. can only return one.
-		 */
-		WSANETWORKEVENTS resEvents;
-
-		for (i = 0; i < numevents; i++)
-		{
-			ZeroMemory(&resEvents, sizeof(resEvents));
-			if (WSAEnumNetworkEvents(sockets[i], events[i], &resEvents) != 0)
-				elog(ERROR, "failed to enumerate network events: error code %d",
-					 WSAGetLastError());
-			/* Read activity? */
-			if (readfds && FD_ISSET(sockets[i], readfds))
-			{
-				if ((resEvents.lNetworkEvents & FD_READ) ||
-					(resEvents.lNetworkEvents & FD_ACCEPT) ||
-					(resEvents.lNetworkEvents & FD_CLOSE))
-				{
-					FD_SET(sockets[i], &outreadfds);
-
-					nummatches++;
-				}
-			}
-			/* Write activity? */
-			if (writefds && FD_ISSET(sockets[i], writefds))
-			{
-				if ((resEvents.lNetworkEvents & FD_WRITE) ||
-					(resEvents.lNetworkEvents & FD_CLOSE))
-				{
-					FD_SET(sockets[i], &outwritefds);
-
-					nummatches++;
-				}
-			}
-		}
-	}
-
-	/* Clean up all the event objects */
-	for (i = 0; i < numevents; i++)
-	{
-		WSAEventSelect(sockets[i], NULL, 0);
-		WSACloseEvent(events[i]);
-	}
-
-	if (r == WSA_WAIT_TIMEOUT)
-	{
-		if (readfds)
-			FD_ZERO(readfds);
-		if (writefds)
-			FD_ZERO(writefds);
-		return 0;
-	}
-
-	/* Signal-like events. */
-	if (r == WAIT_OBJECT_0 + numevents || r == WAIT_IO_COMPLETION)
-	{
-		pgwin32_dispatch_queued_signals();
-		errno = EINTR;
-		if (readfds)
-			FD_ZERO(readfds);
-		if (writefds)
-			FD_ZERO(writefds);
-		return -1;
-	}
-
-	/* Overwrite socket sets with our resulting values */
-	if (readfds)
-		memcpy(readfds, &outreadfds, sizeof(fd_set));
-	if (writefds)
-		memcpy(writefds, &outwritefds, sizeof(fd_set));
-	return nummatches;
-}
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index a0ed6aaeaa..1b605d9403 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -492,7 +492,6 @@ extern int	pgkill(int pid, int sig);
 #define listen(s, backlog) pgwin32_listen(s, backlog)
 #define accept(s, addr, addrlen) pgwin32_accept(s, addr, addrlen)
 #define connect(s, name, namelen) pgwin32_connect(s, name, namelen)
-#define select(n, r, w, e, timeout) pgwin32_select(n, r, w, e, timeout)
 #define recv(s, buf, len, flags) pgwin32_recv(s, buf, len, flags)
 #define send(s, buf, len, flags) pgwin32_send(s, buf, len, flags)
 
@@ -501,7 +500,6 @@ extern int	pgwin32_bind(SOCKET s, struct sockaddr *addr, int addrlen);
 extern int	pgwin32_listen(SOCKET s, int backlog);
 extern SOCKET pgwin32_accept(SOCKET s, struct sockaddr *addr, int *addrlen);
 extern int	pgwin32_connect(SOCKET s, const struct sockaddr *name, int namelen);
-extern int	pgwin32_select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, const struct timeval *timeout);
 extern int	pgwin32_recv(SOCKET s, char *buf, int len, int flags);
 extern int	pgwin32_send(SOCKET s, const void *buf, int len, int flags);
 extern int	pgwin32_waitforsinglesocket(SOCKET s, int what, int timeout);
-- 
2.42.0

From 27144fba3c17e3d15e973e250f3ce3151079aea0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 10 Nov 2023 08:59:45 +1300
Subject: [PATCH v2 5/6] Refactor pgwin32_waitforsinglesocket() to share
 events.

This function is hardly used, since sockets in the backend are almost
always in non-blocking mode.  Ideally they would *always* be
non-blocking, and if we ever get to that situation by project policy we
could just delete this and related code, but in the meantime, we have to
adjust it to use the new per-socket event handle or it could lose
network events.

While here, delete the code paths for UDP which are probably dead code
since we retired the UDP-powered stats collector.  There is another user
of UDP in auth.c, but it's using sendto() and thus not reaching this
code.  Any other user of UDP that I might have missed probably isn't
generating 'high load' like the old stats system that presumably
motivated that sleeping logic.
---
 src/backend/port/win32/socket.c | 103 ++++++--------------------------
 1 file changed, 17 insertions(+), 86 deletions(-)

diff --git a/src/backend/port/win32/socket.c b/src/backend/port/win32/socket.c
index d0cf08392f..797a1f503e 100644
--- a/src/backend/port/win32/socket.c
+++ b/src/backend/port/win32/socket.c
@@ -238,120 +238,51 @@ pgwin32_poll_signals(void)
 	return 0;
 }
 
-static int
-isDataGram(SOCKET s)
-{
-	int			type;
-	int			typelen = sizeof(type);
-
-	if (getsockopt(s, SOL_SOCKET, SO_TYPE, (char *) &type, &typelen))
-		return 1;
-
-	return (type == SOCK_DGRAM) ? 1 : 0;
-}
-
 int
 pgwin32_waitforsinglesocket(SOCKET s, int what, int timeout)
 {
-	static HANDLE waitevent = INVALID_HANDLE_VALUE;
-	static SOCKET current_socket = INVALID_SOCKET;
-	static int	isUDP = 0;
 	HANDLE		events[2];
 	int			r;
 
-	/* Create an event object just once and use it on all future calls */
-	if (waitevent == INVALID_HANDLE_VALUE)
-	{
-		waitevent = CreateEvent(NULL, TRUE, FALSE, NULL);
-
-		if (waitevent == INVALID_HANDLE_VALUE)
-			ereport(ERROR,
-					(errmsg_internal("could not create socket waiting event: error code %lu", GetLastError())));
-	}
-	else if (!ResetEvent(waitevent))
-		ereport(ERROR,
-				(errmsg_internal("could not reset socket waiting event: error code %lu", GetLastError())));
-
-	/*
-	 * Track whether socket is UDP or not.  (NB: most likely, this is both
-	 * useless and wrong; there is no reason to think that the behavior of
-	 * WSAEventSelect is different for TCP and UDP.)
-	 */
-	if (current_socket != s)
-		isUDP = isDataGram(s);
-	current_socket = s;
+	events[0] = pgwin32_signal_event;
+	events[1] = pgwin32_socket_acquire_event_handle(s);
 
-	/*
-	 * Attach event to socket.  NOTE: we must detach it again before
-	 * returning, since other bits of code may try to attach other events to
-	 * the socket.
-	 */
-	if (WSAEventSelect(s, waitevent, what) != 0)
+	if (events[1] == NULL)
 	{
-		TranslateSocketError();
+		/* errno is set */
 		return 0;
 	}
 
-	events[0] = pgwin32_signal_event;
-	events[1] = waitevent;
-
-	/*
-	 * Just a workaround of unknown locking problem with writing in UDP socket
-	 * under high load: Client's pgsql backend sleeps infinitely in
-	 * WaitForMultipleObjectsEx, pgstat process sleeps in pgwin32_select().
-	 * So, we will wait with small timeout(0.1 sec) and if socket is still
-	 * blocked, try WSASend (see comments in pgwin32_select) and wait again.
-	 */
-	if ((what & FD_WRITE) && isUDP)
+	if (pgwin32_socket_select_events(s, what) < 0)
 	{
-		for (;;)
-		{
-			r = WaitForMultipleObjectsEx(2, events, FALSE, 100, TRUE);
-
-			if (r == WAIT_TIMEOUT)
-			{
-				char		c;
-				WSABUF		buf;
-				DWORD		sent;
-
-				buf.buf = &c;
-				buf.len = 0;
-
-				r = WSASend(s, &buf, 1, &sent, 0, NULL, NULL);
-				if (r == 0)		/* Completed - means things are fine! */
-				{
-					WSAEventSelect(s, NULL, 0);
-					return 1;
-				}
-				else if (WSAGetLastError() != WSAEWOULDBLOCK)
-				{
-					TranslateSocketError();
-					WSAEventSelect(s, NULL, 0);
-					return 0;
-				}
-			}
-			else
-				break;
-		}
+		pgwin32_socket_release_event_handle(s);
+		return 0;
 	}
-	else
-		r = WaitForMultipleObjectsEx(2, events, FALSE, timeout, TRUE);
 
-	WSAEventSelect(s, NULL, 0);
+	pgwin32_socket_prepare_to_wait(s);
+
+	r = WaitForMultipleObjectsEx(2, events, FALSE, timeout, TRUE);
 
 	if (r == WAIT_OBJECT_0 || r == WAIT_IO_COMPLETION)
 	{
+		pgwin32_socket_release_event_handle(s);
 		pgwin32_dispatch_queued_signals();
 		errno = EINTR;
 		return 0;
 	}
 	if (r == WAIT_OBJECT_0 + 1)
+	{
+		pgwin32_socket_enumerate_events(s);
+		pgwin32_socket_release_event_handle(s);
 		return 1;
+	}
 	if (r == WAIT_TIMEOUT)
 	{
+		pgwin32_socket_release_event_handle(s);
 		errno = EWOULDBLOCK;
 		return 0;
 	}
+	pgwin32_socket_release_event_handle(s);
 	ereport(ERROR,
 			(errmsg_internal("unrecognized return value from WaitForMultipleObjects: %d (error code %lu)", r, GetLastError())));
 	return 0;
-- 
2.42.0

From 86d3a318af19d8fa476e36aae1c6b754912d7c4c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 10 Nov 2023 10:24:32 +1300
Subject: [PATCH v2 6/6] Reinstate "graceful shutdown" changes for Windows.

This reverts commit 29992a6a509b256efc4ac560a1586b51a64b2637.

See the commit messages for 6051857fc and ed52c3707.
---
 src/backend/libpq/pqcomm.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 522584e597..0ca93fefc8 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -280,15 +280,30 @@ socket_close(int code, Datum arg)
 		secure_close(MyProcPort);
 
 		/*
-		 * Formerly we did an explicit close() here, but it seems better to
-		 * leave the socket open until the process dies.  This allows clients
-		 * to perform a "synchronous close" if they care --- wait till the
-		 * transport layer reports connection closure, and you can be sure the
-		 * backend has exited.
+		 * On most platforms, we leave the socket open until the process dies.
+		 * This allows clients to perform a "synchronous close" if they care
+		 * --- wait till the transport layer reports connection closure, and
+		 * you can be sure the backend has exited.  Saves a kernel call, too.
 		 *
-		 * We do set sock to PGINVALID_SOCKET to prevent any further I/O,
-		 * though.
+		 * However, that does not work on Windows: if the kernel closes the
+		 * socket it will invoke an "abortive shutdown" that discards any data
+		 * not yet sent to the client.  (This is a flat-out violation of the
+		 * TCP RFCs, but count on Microsoft not to care about that.)  To get
+		 * the spec-compliant "graceful shutdown" behavior, we must invoke
+		 * closesocket() explicitly.  When using OpenSSL, it seems that clean
+		 * shutdown also requires an explicit shutdown() call.
+		 *
+		 * This code runs late enough during process shutdown that we should
+		 * have finished all externally-visible shutdown activities, so that
+		 * in principle it's good enough to act as a synchronous close on
+		 * Windows too.  But it's a lot more fragile than the other way.
 		 */
+#ifdef WIN32
+		shutdown(MyProcPort->sock, SD_SEND);
+		closesocket(MyProcPort->sock);
+#endif
+
+		/* In any case, set sock to PGINVALID_SOCKET to prevent further I/O */
 		MyProcPort->sock = PGINVALID_SOCKET;
 	}
 }
-- 
2.42.0

Reply via email to