On Tue, 1 Jul 2025 at 18:50, Jacob Champion <jacob.champ...@enterprisedb.com> 
wrote:

Not what I said. I'm saying that if a server implementation claims
Postgres compatibility but fails to talk to deployed versions of libpq
in practice, people will roll their eyes, even if other client
implementations work. The most widely deployed implementation tends to
be favored independently of technical quality.

I think the confusion comes from the fact that I understood you to say:
"People don't care about the protocol, if libpq allows it, then it's
good enough." But now I think you might have meant: "If libpq doesn't
allow it, even though the spec suggets it should, then people will still
blame the server implemantion".

It's not up to me whether we do or not. Since the protocol is
underspecified and these corner cases are untested, what really
matters is how many people depend on the underspecified behavior. (I
point to the immediately preceding thread as evidence.)

Agreed. But what I was trying to say was that you need more than just
libpq to behave in an unspecified way. You need some critical mass of
clients to behave in a similar enough unspecified way. i.e. if JDBC
cannot connect to your custom postgres server because you went out of
spec, then users will start complaining to you that your server is
broken.

Sidenote: I checked just now. JDBC did send the all zeros message too
before it added 3.2 cancel support. But afaict the 3.2 support
introduced an issue where it will call castNonNull on a null pointer if
used on a server that does not send BackendKeyData. CC-ing Dave Cramer
because that sounds like a bug his employer would probably like fixed
because that would impact the RDS Proxy.

But again: we do not enforce or test this behavior, 

I'd love to have your protocol test suite to be able to add
automated tests for this ;)

so if the revert> happens for 3.0 later, we all have to watch like hawks to 
make sure
that 3.2 is not affected. I need more buy-in for that from people who
are not me.

Okay, but to be clear: You do agree with this approach? (assuming others
will agree too)

Well... okay. I'm reasoning based on what's committed.

Attached is a v2 patchset that addresses this, as well as all the other
changes previously discussed. I tested that the behaviour is as intended
by modifying the PG sources locally.

Personally, I think it's more likely that any new server
implementations with alternative cancellation requirements will start
to silently couple against the new 3.0 behavior. I don't believe for a
minute that third parties "would not dare" to do literally anything
that works in practice.

Maybe in some years, yes. But it seems rather unlikely that people would
start doing that now, because that would mean that old clients that
request 3.0 would suddenly not be able to connect because they get a
"malformed" BackendKeyData message.

if we want other implementations to behave a
certain way then we have to enforce it in the code.

Agreed

P.S. Heiki assigned himself to the open items I created now.
From 973e7204cb6d99450c003cfe729586287a500359 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-t...@jeltef.nl>
Date: Wed, 25 Jun 2025 08:36:51 +0200
Subject: [PATCH v2 1/2] libpq: Complain about missing BackendKeyData later

It turns out that some third party backend implementations^1 don't send
BackendKeyData as a way to indicate that they don't. While the protocol
docs left it up for interpretation if that is valid behavior, libpq has
been accepting it up until the libpq shipped with PG17. It does not seem
like that the libpq behavior was intentional though, since it did so by
sending CancelRequest messages with all zeros to such servers (instead
of returning an error or making the cancel a no-op).

For PG18 this behaviour was changed to return an error when trying to
create the cancel object (either a PGcancel or PGcancelConn). This
wasn't done with any discussion, but was done as part of supporting
different lengths of cancel packets for the new 3.2 version of the
protocol.

This commit changes the behavior once more to only return an error when
the cancel object is actually used to send a cancellation, instead of
when merely creating the object. The reason to do so is that some
clients create such cancel objects as part of their connection creation
logic (thus having the cancel object ready for later when they need it).
So by returning an error when creating that object, the connection
attempt would fail. So by changing when we return the error such clients
will still be able to connect to the third party backend implementations
in question, but when actually trying to cancel a query on one of these
backends the user will be notified that that is not possible for the
server that they are connected to.

^1: AWS RDS Proxy is definitely one of them, and CockroachDB might be
another.
---
 doc/src/sgml/protocol.sgml       | 16 +++++++++++++-
 src/interfaces/libpq/fe-cancel.c | 37 ++++++++++++++++++++++++--------
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 82fe3f93761..0eb96360134 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -535,7 +535,21 @@
         This message provides secret-key data that the frontend must
         save if it wants to be able to issue cancel requests later.
         The frontend should not respond to this message, but should
-        continue listening for a ReadyForQuery message.
+        continue listening for a ReadyForQuery message. The PostgreSQL backend
+        will always send this message before sending the inital ReadyForQuery
+        message but other backend implementations of the protocol may not.
+        Since protocol version 3.2, if the server sent no BackendKeyData, then
+        that means that the backend does not support canceling queries using
+        the CancelRequest messages. In protocol versions before 3.2 the
+        behaviour is undefined if the client receives no BackendKeyData.
+        Up until the libpq shipped in PostgreSQL 17, it would send a
+        CancelRequest with all zeros to 3.0 connections that did not send a
+        BackendKeyData message. Since the libpq shipped with PostgreSQL 18 it
+        does not send any CancelRequest at all for such connections anymore,
+        thus aligning with the behaviour for protocol 3.2. Throwing an error
+        as a client when not receiving BackendKeyData on a 3.0 connection is
+        not recommended, because certain server implementations are known not
+        to send it.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/interfaces/libpq/fe-cancel.c b/src/interfaces/libpq/fe-cancel.c
index 65517c5703b..c7e90135c06 100644
--- a/src/interfaces/libpq/fe-cancel.c
+++ b/src/interfaces/libpq/fe-cancel.c
@@ -86,13 +86,6 @@ PQcancelCreate(PGconn *conn)
 		return (PGcancelConn *) cancelConn;
 	}
 
-	/* Check that we have received a cancellation key */
-	if (conn->be_cancel_key_len == 0)
-	{
-		libpq_append_conn_error(cancelConn, "no cancellation key received");
-		return (PGcancelConn *) cancelConn;
-	}
-
 	/*
 	 * Indicate that this connection is used to send a cancellation
 	 */
@@ -111,7 +104,7 @@ PQcancelCreate(PGconn *conn)
 	 * Copy cancellation token data from the original connection
 	 */
 	cancelConn->be_pid = conn->be_pid;
-	if (conn->be_cancel_key != NULL)
+	if (conn->be_cancel_key_len > 0)
 	{
 		cancelConn->be_cancel_key = malloc(conn->be_cancel_key_len);
 		if (cancelConn->be_cancel_key == NULL)
@@ -206,6 +199,17 @@ PQcancelStart(PGcancelConn *cancelConn)
 	if (!cancelConn || cancelConn->conn.status == CONNECTION_BAD)
 		return 0;
 
+	/*
+	 * Check that we actually have a concel key. We check this here as apposed
+	 * to in PQcancelCreate because users of libpq might call PQcancelCreate
+	 * even when they don't need to cancel a connection.
+	 */
+	if (cancelConn->conn.be_cancel_key_len == 0)
+	{
+		libpq_append_conn_error(&cancelConn->conn, "no cancellation key received");
+		return 0;
+	}
+
 	if (cancelConn->conn.status != CONNECTION_ALLOCATED)
 	{
 		libpq_append_conn_error(&cancelConn->conn,
@@ -379,7 +383,14 @@ PQgetCancel(PGconn *conn)
 
 	/* Check that we have received a cancellation key */
 	if (conn->be_cancel_key_len == 0)
-		return NULL;
+	{
+		/*
+		 * In case there is no cancel key, we return an all-zero PGCancel
+		 * object. Actually calling PQcancel on this will fail, but we allow
+		 * creating the PGCancel object
+		 */
+		return calloc(1, sizeof(PGcancel));
+	}
 
 	cancel_req_len = offsetof(CancelRequestPacket, cancelAuthCode) + conn->be_cancel_key_len;
 	cancel = malloc(offsetof(PGcancel, cancel_req) + cancel_req_len);
@@ -544,6 +555,14 @@ PQcancel(PGcancel *cancel, char *errbuf, int errbufsize)
 		return false;
 	}
 
+	if (cancel->cancel_pkt_len == 0)
+	{
+		strlcpy(errbuf, "PQcancel() -- no cancellation key received", errbufsize);
+		/* strlcpy probably doesn't change errno, but be paranoid */
+		SOCK_ERRNO_SET(save_errno);
+		return false;
+	}
+
 	/*
 	 * We need to open a temporary connection to the postmaster. Do this with
 	 * only kernel calls.

base-commit: fe05430ace8e0b3c945cf581564458a5983a07b6
-- 
2.43.0

From 99144bd01618dee1f6d6f5b95aaa61ceada285db Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-t...@jeltef.nl>
Date: Wed, 25 Jun 2025 08:54:15 +0200
Subject: [PATCH v2 2/2] libpq: Be strict about accept cancel key lengths

The protocol documentation states that the maximum length of a cancel key
is 256 bytes. This starts checking for that limit in libpq. Otherwise
third party backend implementations will probably start using more bytes
anyway. We also start requiring that a protocol 3.0 connection does not
send a longer cancel key, to make sure that servers don't start breaking
old 3.0-only clients by accident. Finally this also restricts the
minimum key length to 4 bytes (both in the protocol spec and in the
libpq implementation).
---
 doc/src/sgml/protocol.sgml          |  2 +-
 src/interfaces/libpq/fe-connect.c   |  3 +++
 src/interfaces/libpq/fe-protocol3.c | 26 ++++++++++++++++++++++++--
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 0eb96360134..982f4a8d210 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -4160,7 +4160,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
          message, indicated by the length field.
         </para>
         <para>
-          The maximum key length is 256 bytes. The
+          The minimum and maximum key length are 4 and 256 bytes respectively. The
           <productname>PostgreSQL</productname> server only sends keys up to
           32 bytes, but the larger maximum size allows for future server
           versions, as well as connection poolers and other middleware, to use
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 51a9c416584..f094611fe0d 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -4322,6 +4322,9 @@ keep_going:						/* We will come back to here until there is
 				if (PQisBusy(conn))
 					return PGRES_POLLING_READING;
 
+				if (conn->status == CONNECTION_BAD)
+					goto error_return;
+
 				res = PQgetResult(conn);
 
 				/*
diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index 1599de757d1..29c51ede7ce 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -1547,13 +1547,31 @@ getBackendKeyData(PGconn *conn, int msgLength)
 
 	cancel_key_len = 5 + msgLength - (conn->inCursor - conn->inStart);
 
+	if (cancel_key_len != 4 && conn->pversion == PG_PROTOCOL(3, 0))
+	{
+		libpq_append_conn_error(conn, "received invalid BackendKeyData message: cancel key length %d is different from 4, which is not supported in version 3.0 of the protocol", cancel_key_len);
+		goto failure;
+	}
+
+	if (cancel_key_len < 4)
+	{
+		libpq_append_conn_error(conn, "received invalid BackendKeyData message: cancel key length %d is below minimum of 4 bytes", cancel_key_len);
+		goto failure;
+	}
+
+	if (cancel_key_len > 256)
+	{
+		libpq_append_conn_error(conn, "received invalid BackendKeyData message: cancel key length %d exceeds maximum of 256 bytes", cancel_key_len);
+		goto failure;
+	}
+
 	conn->be_cancel_key = malloc(cancel_key_len);
 	if (conn->be_cancel_key == NULL)
 	{
 		libpq_append_conn_error(conn, "out of memory");
-		/* discard the message */
-		return EOF;
+		goto failure;
 	}
+
 	if (pqGetnchar(conn->be_cancel_key, cancel_key_len, conn))
 	{
 		free(conn->be_cancel_key);
@@ -1562,6 +1580,10 @@ getBackendKeyData(PGconn *conn, int msgLength)
 	}
 	conn->be_cancel_key_len = cancel_key_len;
 	return 0;
+
+failure:
+	conn->status = CONNECTION_BAD;
+	return EOF;
 }
 
 
-- 
2.43.0

Reply via email to