On Tue, Jan 5, 2016 at 10:39 AM, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:

> Hackers,
>
> It looks like there's an inconsistency in error handling during
> START_REPLICATION command of replication protocol:
>
> $ psql postgres://localhost/psycopg2test?replication=database
> psql (9.6devel)
> Type "help" for help.
>
> psycopg2test=# IDENTIFY_SYSTEM;
>       systemid       | timeline |  xlogpos  |    dbname
> ---------------------+----------+-----------+--------------
>  6235978519197579707 |        1 | 0/2CE0F78 | psycopg2test
> (1 row)
>
> psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid"
> "value");
> ERROR:  syntax error
>
> 1) syntax errors are reported and client can retry with corrected command:
>
> psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid"
> 'value');
> ERROR:  replication slot name "TEST1" contains invalid character
> HINT:  Replication slot names may only contain lower case letters,
> numbers, and the underscore character.
>
> 2) further errors are reported and we can retry:
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
> 'value');
> ERROR:  replication slot "test1" does not exist
>
> psycopg2test=# CREATE_REPLICATION_SLOT "test1" LOGICAL "test_decoding";
>  slot_name | consistent_point | snapshot_name | output_plugin
> -----------+------------------+---------------+---------------
>  test1     | 0/2CF5340        | 0000088C-1    | test_decoding
> (1 row)
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
> 'value');
> unexpected PQresultStatus: 8
>
> The last command results in the following output sent to the server log:
>
> ERROR:  option "invalid" = "value" is unknown
> CONTEXT:  slot "test1", output plugin "test_decoding", in the startup
> callback
>
> But the client has no way to learn about the error, nor can it restart
> with correct one (the server has entered COPY protocol mode):
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
> PQexec not allowed during COPY BOTH
>
> I recall Craig Rigner mentioning this issue in context of the
> pglogical_output plugin, but I thought that was something to do with the
> startup packet.  The behavior above doesn't strike me as very consistent,
> we should be able to detect and report errors during output plugin startup
> and let the client retry with the corrected command as we do for syntax or
> other errors.
>
> I didn't look in the code yet, but if someone knows off top of the head
> the reason to this behavior, I'd be glad to learn it.
>

Looks like the attached patch fixes it for me:

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
'value');
ERROR:  option "invalid" = "value" is unknown
CONTEXT:  slot "test1", output plugin "test_decoding", in the startup
callback

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
unexpected PQresultStatus: 8

There was a similar problem with PHYSICAL replication--if
the requested start LSN is in the future, the error is not reported to the
client and connection is unusable:

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
unexpected PQresultStatus: 8

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
PQexec not allowed during COPY BOTH

After the patch:

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
ERROR:  requested starting point 1FF0/0 is ahead of the WAL flush position
of this server 0/2DE6C28

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
unexpected PQresultStatus: 8


Actually, would it make sense to move the block that sends CopyBothResponse
message to the WalSndLoop() instead?  I think you can't get any closer than
that...

--
Alex
From 4ebfa71a1d4e4d43415ad9c2a8142f45444313ec Mon Sep 17 00:00:00 2001
From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
Date: Tue, 5 Jan 2016 11:34:37 +0100
Subject: [PATCH] Move CopyBothResponse closer to WalSndLoop()

This is required in order for any initialization errors to be properly
reported to the client and to avoid entering COPY BOTH mode
prematurely (the client will not be able to correct the error after
that even if it could learn about it somehow).
---
 src/backend/replication/walsender.c | 39 ++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c03e045..3b83db7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -629,13 +629,6 @@ StartReplication(StartReplicationCmd *cmd)
 		 */
 		WalSndSetState(WALSNDSTATE_CATCHUP);
 
-		/* Send a CopyBothResponse message, and start streaming */
-		pq_beginmessage(&buf, 'W');
-		pq_sendbyte(&buf, 0);
-		pq_sendint(&buf, 0, 2);
-		pq_endmessage(&buf);
-		pq_flush();
-
 		/*
 		 * Don't allow a request to stream from a future point in WAL that
 		 * hasn't been flushed to disk in this server yet.
@@ -667,6 +660,18 @@ StartReplication(StartReplicationCmd *cmd)
 		/* Main loop of walsender */
 		replication_active = true;
 
+		/* Send a CopyBothResponse message, and start streaming.
+		 *
+		 * We need to do this after all initialization completes without
+		 * errors, otherwise the error will not be reported to the client and
+		 * the connection will be left in COPY BOTH mode.
+		 */
+		pq_beginmessage(&buf, 'W');
+		pq_sendbyte(&buf, 0);
+		pq_sendint(&buf, 0, 2);
+		pq_endmessage(&buf);
+		pq_flush();
+
 		WalSndLoop(XLogSendPhysical);
 
 		replication_active = false;
@@ -975,13 +980,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	WalSndSetState(WALSNDSTATE_CATCHUP);
 
-	/* Send a CopyBothResponse message, and start streaming */
-	pq_beginmessage(&buf, 'W');
-	pq_sendbyte(&buf, 0);
-	pq_sendint(&buf, 0, 2);
-	pq_endmessage(&buf);
-	pq_flush();
-
 	/* setup state for XLogReadPage */
 	sendTimeLineIsHistoric = false;
 	sendTimeLine = ThisTimeLineID;
@@ -1017,6 +1015,19 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	SyncRepInitConfig();
 
+	/*
+	 * Send a CopyBothResponse message, and start streaming.
+	 *
+	 * We only do this just before the main loop, otherwise an error in output
+	 * plugin startup callback, e.g. invalid option value, will not be
+	 * reported to the client and connection will stay in COPY BOTH mode.
+	 */
+	pq_beginmessage(&buf, 'W');
+	pq_sendbyte(&buf, 0);
+	pq_sendint(&buf, 0, 2);
+	pq_endmessage(&buf);
+	pq_flush();
+
 	/* Main loop of walsender */
 	WalSndLoop(XLogSendLogical);
 
-- 
2.5.0

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to