On Fri, Oct 2, 2020 at 11:30 PM Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
>
> > Attaching v8 patch, please review it.. Both make check and make
> > check-world passes on v8.
>
> Thanks for updating the patch! It basically looks good to me.
> I tweaked the patch as follows.
>
> +               if (!entry->conn ||
> +                       PQstatus(entry->conn) != CONNECTION_BAD ||
>
> With the above change, if entry->conn is NULL, an error is thrown and no new
> connection is reestablished. But why? IMO it's more natural to reestablish
> new connection in that case. So I removed "!entry->conn" from the above
> condition.
>

Yeah, that makes sense.

>
> +               ereport(DEBUG3,
> +                               (errmsg("could not start remote transaction 
> on connection %p",
> +                                entry->conn)),
>
> I replaced errmsg() with errmsg_internal() because the translation of
> this debug message is not necessary.
>

I'm okay with this as we don't have any specific strings that need
translation in the debug message. But, should we also try to have
errmsg_internal in a few other places in connection.c?

                 errmsg("could not obtain message string for remote error"),
                 errmsg("cannot PREPARE a transaction that has
operated on postgres_fdw foreign tables")));
                 errmsg("password is required"),

I see the errmsg() with plain texts in other places in the code base
as well. Is it  that we look at the error message and if it is a plain
text(without database objects or table data), we decide to have no
translation? Or is there any other policy?

>
> +SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE
> +backend_type = 'client backend' AND application_name = 'fdw_retry_check';
> +CALL wait_for_backend_termination();
>
> Since we always use pg_terminate_backend() and wait_for_backend_termination()
> together, I merged them into one function.
>
> I simplied the comments on the regression test.
>

+1. I slightly adjusted comments in connection.c and ran pg_indent to
keep them upt 80 char limit.

>
>   Attached is the updated version of the patch. If this patch is ok,
>   I'd like to mark it as ready for committer.
>

Thanks. Attaching v10 patch. Please have a look.

>
> > I have another question not related to this patch: though we have
> > wait_pid() function, we are not able to use it like
> > pg_terminate_backend() in other modules, wouldn't be nice if we can
> > make it generic under the name pg_wait_pid() and usable across all pg
> > modules?
>
> I thought that, too. But I could not come up with good idea for *real* use 
> case
> of that function. At least that's useful for the regression test, though.
> Anyway, IMO it's worth proposing that and hearing more opinions about that
> from other hackers.
>

Yes it will be useful for testing when coupled with
pg_terminate_backend(). I will post the idea in a separate thread soon
for more thoughts.



With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 68d764690fc0409ee15afbbbd00974e0d43116cf Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddy@enterprisedb.com>
Date: Sat, 3 Oct 2020 05:49:18 +0530
Subject: [PATCH v10] Retry Cached Remote Connections For postgres_fdw

Remote connections are cached for postgres_fdw in the local
backend. There are high chances that the remote backend may
not be available i.e. it can get killed, the subsequent foreign
queries from local backend/session that use cached connection
will fail as the remote backend would have become unavailable.

This patch proposes a retry mechanism. Local backend/session
uses cached connection and if it receives connection bad error,
it deletes the cached entry and retry to get a new connection.
This retry happens only at the start of remote xact.
---
 contrib/postgres_fdw/connection.c             | 53 ++++++++++++++-----
 .../postgres_fdw/expected/postgres_fdw.out    | 48 +++++++++++++++++
 contrib/postgres_fdw/sql/postgres_fdw.sql     | 41 ++++++++++++++
 3 files changed, 130 insertions(+), 12 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 08daf26fdf..5fa4b634c3 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -108,6 +108,7 @@ PGconn *
 GetConnection(UserMapping *user, bool will_prep_stmt)
 {
 	bool		found;
+	volatile bool retry_conn = false;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
 
@@ -159,23 +160,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/* Reject further use of connections which failed abort cleanup. */
 	pgfdw_reject_incomplete_xact_state_change(entry);
 
+retry:
 	/*
 	 * If the connection needs to be remade due to invalidation, disconnect as
-	 * soon as we're out of all transactions.
+	 * soon as we're out of all transactions. Also, if previous attempt to
+	 * start new remote transaction failed on the cached connection,
+	 * disconnect it to retry a new connection.
 	 */
-	if (entry->conn != NULL && entry->invalidated && entry->xact_depth == 0)
+	if ((entry->conn != NULL && entry->invalidated &&
+		 entry->xact_depth == 0) || retry_conn)
 	{
-		elog(DEBUG3, "closing connection %p for option changes to take effect",
-			 entry->conn);
+		if (retry_conn)
+			elog(DEBUG3, "closing connection %p to reestablish a new one",
+				 entry->conn);
+		else
+			elog(DEBUG3, "closing connection %p for option changes to take effect",
+				 entry->conn);
 		disconnect_pg_server(entry);
 	}
 
-	/*
-	 * We don't check the health of cached connection here, because it would
-	 * require some overhead.  Broken connection will be detected when the
-	 * connection is actually used.
-	 */
-
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
 	 * connection.  (If connect_pg_server throws an error, the cache entry
@@ -206,9 +209,35 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	}
 
 	/*
-	 * Start a new transaction or subtransaction if needed.
+	 * We check the health of the cached connection here when starting a new
+	 * remote transaction. If a broken connection is detected in the first
+	 * attempt, we try to reestablish a new connection. If broken connection
+	 * is detected again here, we give up getting a connection.
 	 */
-	begin_remote_xact(entry);
+	PG_TRY();
+	{
+		/* Start a new transaction or subtransaction if needed. */
+		begin_remote_xact(entry);
+		retry_conn = false;
+	}
+	PG_CATCH();
+	{
+		if (PQstatus(entry->conn) != CONNECTION_BAD ||
+			entry->xact_depth > 0 ||
+			retry_conn)
+			PG_RE_THROW();
+		retry_conn = true;
+	}
+	PG_END_TRY();
+
+	if (retry_conn)
+	{
+		ereport(DEBUG3,
+				(errmsg_internal("could not start remote transaction on connection %p",
+								 entry->conn)),
+				errdetail_internal("%s", pchomp(PQerrorMessage(entry->conn))));
+		goto retry;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 10e23d02ed..2c5614073f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8987,3 +8987,51 @@ PREPARE TRANSACTION 'fdw_tpc';
 ERROR:  cannot PREPARE a transaction that has operated on postgres_fdw foreign tables
 ROLLBACK;
 WARNING:  there is no transaction in progress
+-- ===================================================================
+-- reestablish new connection
+-- ===================================================================
+-- Terminate the backend having the specified application_name and wait for
+-- the termination to complete.
+CREATE OR REPLACE PROCEDURE terminate_backend_and_wait(appname text) AS $$
+BEGIN
+    PERFORM pg_terminate_backend(pid) FROM pg_stat_activity
+    WHERE application_name = appname;
+    LOOP
+        PERFORM * FROM pg_stat_activity WHERE application_name = appname;
+        EXIT WHEN NOT FOUND;
+        PERFORM pg_sleep(1), pg_stat_clear_snapshot();
+    END LOOP;
+END;
+$$ LANGUAGE plpgsql;
+-- Change application_name of remote connection to special one
+-- so that we can easily terminate the connection later.
+ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check');
+SELECT 1 FROM ft1 LIMIT 1;
+ ?column? 
+----------
+        1
+(1 row)
+
+-- Terminate the remote connection.
+CALL terminate_backend_and_wait('fdw_retry_check');
+-- This query should detect the broken connection when starting new remote
+-- transaction, reestablish new connection, and then succeed.
+BEGIN;
+SELECT 1 FROM ft1 LIMIT 1;
+ ?column? 
+----------
+        1
+(1 row)
+
+-- If the query detects the broken connection when starting new remote
+-- subtransaction, it doesn't reestablish new connection and should fail.
+CALL terminate_backend_and_wait('fdw_retry_check');
+SAVEPOINT s;
+SELECT 1 FROM ft1 LIMIT 1;    -- should fail
+ERROR:  server closed the connection unexpectedly
+	This probably means the server terminated abnormally
+	before or while processing the request.
+CONTEXT:  remote SQL command: SAVEPOINT s2
+COMMIT;
+-- Clean up
+DROP PROCEDURE terminate_backend_and_wait(text);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 78156d10b4..4da1f78956 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -2653,3 +2653,44 @@ SELECT count(*) FROM ft1;
 -- error here
 PREPARE TRANSACTION 'fdw_tpc';
 ROLLBACK;
+
+-- ===================================================================
+-- reestablish new connection
+-- ===================================================================
+
+-- Terminate the backend having the specified application_name and wait for
+-- the termination to complete.
+CREATE OR REPLACE PROCEDURE terminate_backend_and_wait(appname text) AS $$
+BEGIN
+    PERFORM pg_terminate_backend(pid) FROM pg_stat_activity
+    WHERE application_name = appname;
+    LOOP
+        PERFORM * FROM pg_stat_activity WHERE application_name = appname;
+        EXIT WHEN NOT FOUND;
+        PERFORM pg_sleep(1), pg_stat_clear_snapshot();
+    END LOOP;
+END;
+$$ LANGUAGE plpgsql;
+
+-- Change application_name of remote connection to special one
+-- so that we can easily terminate the connection later.
+ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check');
+SELECT 1 FROM ft1 LIMIT 1;
+
+-- Terminate the remote connection.
+CALL terminate_backend_and_wait('fdw_retry_check');
+
+-- This query should detect the broken connection when starting new remote
+-- transaction, reestablish new connection, and then succeed.
+BEGIN;
+SELECT 1 FROM ft1 LIMIT 1;
+
+-- If the query detects the broken connection when starting new remote
+-- subtransaction, it doesn't reestablish new connection and should fail.
+CALL terminate_backend_and_wait('fdw_retry_check');
+SAVEPOINT s;
+SELECT 1 FROM ft1 LIMIT 1;    -- should fail
+COMMIT;
+
+-- Clean up
+DROP PROCEDURE terminate_backend_and_wait(text);
-- 
2.25.1

Reply via email to