Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-03-11 Thread David Steele
On 3/11/16 1:11 PM, Tom Lane wrote:
> Robert Haas  writes:
>> On Fri, Mar 11, 2016 at 11:36 AM, David Steele  wrote:
>>> It looks like a decision needs to be made here whether to apply this patch,
>>> send it back to the author, or reject it so I'm marking it "Ready for
>>> Committer".
>>>
>>> Robert, since you were participating in this conversation can you have a
>>> look?
> 
>> Who, me?  *looks around*
>>
>> OK, I had a look.  I don't think this is a bug.
> 
> I agree, and Craig's complaint that this change reduces flexibility
> for plugins seems to tilt the balance against it.  I see that Alex
> himself accepted that argument in his last message in the thread
> ().
>
> So let's mark it rejected and move on.

+1 and done.

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-03-11 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 11, 2016 at 11:36 AM, David Steele  wrote:
>> It looks like a decision needs to be made here whether to apply this patch,
>> send it back to the author, or reject it so I'm marking it "Ready for
>> Committer".
>> 
>> Robert, since you were participating in this conversation can you have a
>> look?

> Who, me?  *looks around*
> 
> OK, I had a look.  I don't think this is a bug.

I agree, and Craig's complaint that this change reduces flexibility
for plugins seems to tilt the balance against it.  I see that Alex
himself accepted that argument in his last message in the thread
().
So let's mark it rejected and move on.

> unexpected PQresultStatus: 8

It is a bit annoying that psql doesn't print "COPY_BOTH" here.
But fixing that is not this patch, either.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-03-11 Thread Robert Haas
On Fri, Mar 11, 2016 at 11:36 AM, David Steele  wrote:
> It looks like a decision needs to be made here whether to apply this patch,
> send it back to the author, or reject it so I'm marking it "Ready for
> Committer".
>
> Robert, since you were participating in this conversation can you have a
> look?

Who, me?  *looks around*

OK, I had a look.  I don't think this is a bug.  The original report
is complaining about this:

=
psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid" 'value');
unexpected PQresultStatus: 8

The last command results in the following output sent to the server log:

ERROR:  option "invalid" = "value" is unknown
CONTEXT:  slot "test1", output plugin "test_decoding", in the startup callback
=

But that's just because psql knows utterly zero about copy-both mode.
So the problem here isn't that the server is doing anything it
shouldn't be doing; the server code is totally legitimate.  It's just
that the client is stupid.

Now, we could commit this patch anyway.  It won't hurt anything.  But
it won't help very much, either.  The fundamental issue is that
issuing a START_REPLICATION_SLOT command in psql can't be expected to
work.  If you do that, it's either going to fail with a server error,
or it's going to fail with "unexpected PQresultStatus: 8".  Committing
this patch would cause this particular example to fail in the first
way rather than the second one, and I'm not direly opposed to that if
somebody really feels strongly about it.  But it seems like the wrong
thing to focus on.  I think the real question is: why are you
insisting on issuing a command in psql that can NEVER succeed there?
If you want that to work, you're going to need to add copy-both
support to psql, which might be a fun project but has nothing to do
with this patch.

So I am inclined to think it's fine to reject this, unless somebody
wants to argue that there's some advantage to changing it than I am
missing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-03-11 Thread David Steele
On 1/21/16 9:53 AM, Shulgin, Oleksandr wrote:
> On Thu, Jan 21, 2016 at 3:25 PM, Robert Haas  > wrote:
>
>
> So it's true that the client can't unilaterally terminate COPY BOTH
> mode; it can only send CopyDone.  But an error on the server side
> should do so.
>
>
> Hm, you're right.  Even though the server sends COPY_BOTH message
> before the plugin startup, an attempt by the client to actually read
> the data messages results in ErrorResponse handled on the client, then
> the client can re-submit the corrected START_REPLICATION command and
> continue without the need to reconnect.  This cannot be actually
> tested in psql, but I can verify the behavior with my python scripts.
>
> Then I don't have a preference for the early error reporting in this
> case.  If the current behavior potentially allows for more flexible
> error reporting, I'm for keeping it.

It looks like a decision needs to be made here whether to apply this
patch, send it back to the author, or reject it so I'm marking it "Ready
for Committer".

Robert, since you were participating in this conversation can you have a
look?

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-21 Thread Robert Haas
On Wed, Jan 20, 2016 at 2:28 AM, Craig Ringer  wrote:
> It enters COPY BOTH mode before it invokes the startup callback. The client
> has no way to unilaterally terminate COPY BOTH mode and return to the normal
> walsender protocol. The server doesn't do it when there's an ERROR. So the
> only option the client has for recovery is to disconnect and reconnect.

This isn't my understanding of how the protocol works, and it's not
what the documentation says either.

Per http://www.postgresql.org/docs/current/static/protocol-flow.html:

===
In the event of a backend-detected error during copy-both mode, the
backend will issue an ErrorResponse message, discard frontend messages
until a Sync message is received, and then issue ReadyForQuery and
return to normal processing. The frontend should treat receipt of
ErrorResponse as terminating the copy in both directions; no CopyDone
should be sent in this case.
===

So it's true that the client can't unilaterally terminate COPY BOTH
mode; it can only send CopyDone.  But an error on the server side
should do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-21 Thread Shulgin, Oleksandr
On Thu, Jan 21, 2016 at 3:25 PM, Robert Haas  wrote:

> On Wed, Jan 20, 2016 at 2:28 AM, Craig Ringer 
> wrote:
> > It enters COPY BOTH mode before it invokes the startup callback. The
> client
> > has no way to unilaterally terminate COPY BOTH mode and return to the
> normal
> > walsender protocol. The server doesn't do it when there's an ERROR. So
> the
> > only option the client has for recovery is to disconnect and reconnect.
>
> This isn't my understanding of how the protocol works, and it's not
> what the documentation says either.
>
> Per http://www.postgresql.org/docs/current/static/protocol-flow.html:
>
> ===
> In the event of a backend-detected error during copy-both mode, the
> backend will issue an ErrorResponse message, discard frontend messages
> until a Sync message is received, and then issue ReadyForQuery and
> return to normal processing. The frontend should treat receipt of
> ErrorResponse as terminating the copy in both directions; no CopyDone
> should be sent in this case.
> ===
>
> So it's true that the client can't unilaterally terminate COPY BOTH
> mode; it can only send CopyDone.  But an error on the server side
> should do so.
>

Hm, you're right.  Even though the server sends COPY_BOTH message before
the plugin startup, an attempt by the client to actually read the data
messages results in ErrorResponse handled on the client, then the client
can re-submit the corrected START_REPLICATION command and continue without
the need to reconnect.  This cannot be actually tested in psql, but I can
verify the behavior with my python scripts.

Then I don't have a preference for the early error reporting in this case.
If the current behavior potentially allows for more flexible error
reporting, I'm for keeping it.

--
Alex


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-20 Thread Craig Ringer
On 20 January 2016 at 15:28, Craig Ringer  wrote:

>
> For that reason I'd actually like to enter COPY BOTH mode before the
> startup callback, as is currently the case. So I'd like to wrap the
> decoding startup callback in a PG_TRY that catches an ERROR raised by the
> startup callback (if any) and exits COPY BOTH mode before re-throwing.
>
>
I hoped this'd be as simple as:

  PG_TRY();
  {
  logical_decoding_ctx = CreateDecodingContext(
 cmd->startpoint,
cmd->options,

 logical_read_xlog_page,
  WalSndPrepareWrite,
WalSndWriteData);
  }
  PG_CATCH();
  {
  pq_putmessage_noblock('c', NULL, 0);
  pq_flush();
  PG_RE_THROW();
  }
  PG_END_TRY();


but that's doesn't look sufficient, presumably as the client never has a
chance to ack the exit from CopyBoth mode. Or maybe libpq just isn't
prepared to cope with existing COPY then immediately getting an error.

What you have now works and fixes the immediate problem. Preserving the
early start of COPY mode isn't actually useful until we're also prepared to
accept writes from the output plugin's startup callback at that point,
which we are not yet.

Maybe we should just apply your patch as-is, then move the start of COPY
back if/when support for emitting output from within a logical decoding
startup callback gets added.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-19 Thread Craig Ringer
On 5 January 2016 at 18:35, Shulgin, Oleksandr  wrote:


> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
>> 'value');
>> unexpected PQresultStatus: 8
>>
>
I think the point here is that START_REPLICATION issues useful errors and
returns to the normal protocol up until the logical decoding startup
callback is invoked.

It enters COPY BOTH mode before it invokes the startup callback. The client
has no way to unilaterally terminate COPY BOTH mode and return to the
normal walsender protocol. The server doesn't do it when there's an ERROR.
So the only option the client has for recovery is to disconnect and
reconnect.


> I recall Craig Rigner mentioning this issue in context of the
>> pglogical_output plugin, but I thought that was something to do with the
>> startup packet.  The behavior above doesn't strike me as very consistent,
>> we should be able to detect and report errors during output plugin startup
>> and let the client retry with the corrected command as we do for syntax or
>> other errors.
>>
>
I agree. This would also let clients have some limited facility for
negotiating options with the output plugin, albeit via the very clumsy
channel of libpq protocol error message records and their text.




> Looks like the attached patch fixes it for me:
>>
>
For those who didn't read it, the patch moves the start of COPY BOTH mode
to after the logical decoding startup callback is invoked.

This certainly fixes the immediate issue. It also forecloses something else
I'd really like though, which is the ability for plugins to send output to
clients from within their decoding startup hook.

For that reason I'd actually like to enter COPY BOTH mode before the
startup callback, as is currently the case. So I'd like to wrap the
decoding startup callback in a PG_TRY that catches an ERROR raised by the
startup callback (if any) and exits COPY BOTH mode before re-throwing.

Sound reasonable?


I'm looking toward a future where the decoding startup callback can emit a
structured, plugin-specific CopyData-wrapped message with information the
downstream might need to negotiate a connection. It could then exit back to
walsender mode by throwing an ERROR. Or, better, by setting some new
decoding context field that's checked after each callback to see if the
plugin wants to exit COPY BOTH mode and return to walsender protocol mode
without spamming the logs with an ERROR. That way we can finally do two-way
negotiation between downstream and upstream.

(Related to this, I also want to add a new decoding callback that lets the
downstream send arbitrary CopyData back up the wire to the upstream where
it's passed to the decoding callback. This would give the downstream a
backchannel to change settings, request things from the output plugin, etc,
w/o lots of disconnects and reconnects.)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-06 Thread Shulgin, Oleksandr
On Tue, Jan 5, 2016 at 11:35 AM, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:

> On Tue, Jan 5, 2016 at 10:39 AM, Shulgin, Oleksandr <
> oleksandr.shul...@zalando.de> wrote:
>
>>
>> I didn't look in the code yet, but if someone knows off top of the head
>> the reason to this behavior, I'd be glad to learn it.
>>
>
> Looks like the attached patch fixes it for me:
>

Should I add it to the current commitfest?  It seems to me, the patch is
trivial and unambiguous, so that might be an overkill...

Thanks.
--
Alex


[HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-05 Thread Shulgin, Oleksandr
Hackers,

It looks like there's an inconsistency in error handling during
START_REPLICATION command of replication protocol:

$ psql postgres://localhost/psycopg2test?replication=database
psql (9.6devel)
Type "help" for help.

psycopg2test=# IDENTIFY_SYSTEM;
  systemid   | timeline |  xlogpos  |dbname
-+--+---+--
 6235978519197579707 |1 | 0/2CE0F78 | psycopg2test
(1 row)

psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid"
"value");
ERROR:  syntax error

1) syntax errors are reported and client can retry with corrected command:

psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid"
'value');
ERROR:  replication slot name "TEST1" contains invalid character
HINT:  Replication slot names may only contain lower case letters, numbers,
and the underscore character.

2) further errors are reported and we can retry:

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
'value');
ERROR:  replication slot "test1" does not exist

psycopg2test=# CREATE_REPLICATION_SLOT "test1" LOGICAL "test_decoding";
 slot_name | consistent_point | snapshot_name | output_plugin
---+--+---+---
 test1 | 0/2CF5340| 088C-1| test_decoding
(1 row)

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
'value');
unexpected PQresultStatus: 8

The last command results in the following output sent to the server log:

ERROR:  option "invalid" = "value" is unknown
CONTEXT:  slot "test1", output plugin "test_decoding", in the startup
callback

But the client has no way to learn about the error, nor can it restart with
correct one (the server has entered COPY protocol mode):

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
PQexec not allowed during COPY BOTH

I recall Craig Rigner mentioning this issue in context of the
pglogical_output plugin, but I thought that was something to do with the
startup packet.  The behavior above doesn't strike me as very consistent,
we should be able to detect and report errors during output plugin startup
and let the client retry with the corrected command as we do for syntax or
other errors.

I didn't look in the code yet, but if someone knows off top of the head the
reason to this behavior, I'd be glad to learn it.

Cheers.
-- 
*Oleksandr "Alex" Shulgin*
*Database Engineer*

Mobile: +49 160 84-90-639
Email: oleksandr.shul...@zalando.de


Re: [HACKERS] Inconsistent error handling in START_REPLICATION command

2016-01-05 Thread Shulgin, Oleksandr
On Tue, Jan 5, 2016 at 10:39 AM, Shulgin, Oleksandr <
oleksandr.shul...@zalando.de> wrote:

> Hackers,
>
> It looks like there's an inconsistency in error handling during
> START_REPLICATION command of replication protocol:
>
> $ psql postgres://localhost/psycopg2test?replication=database
> psql (9.6devel)
> Type "help" for help.
>
> psycopg2test=# IDENTIFY_SYSTEM;
>   systemid   | timeline |  xlogpos  |dbname
> -+--+---+--
>  6235978519197579707 |1 | 0/2CE0F78 | psycopg2test
> (1 row)
>
> psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid"
> "value");
> ERROR:  syntax error
>
> 1) syntax errors are reported and client can retry with corrected command:
>
> psycopg2test=# START_REPLICATION SLOT "TEST1" LOGICAL 0/0 ("invalid"
> 'value');
> ERROR:  replication slot name "TEST1" contains invalid character
> HINT:  Replication slot names may only contain lower case letters,
> numbers, and the underscore character.
>
> 2) further errors are reported and we can retry:
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
> 'value');
> ERROR:  replication slot "test1" does not exist
>
> psycopg2test=# CREATE_REPLICATION_SLOT "test1" LOGICAL "test_decoding";
>  slot_name | consistent_point | snapshot_name | output_plugin
> ---+--+---+---
>  test1 | 0/2CF5340| 088C-1| test_decoding
> (1 row)
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
> 'value');
> unexpected PQresultStatus: 8
>
> The last command results in the following output sent to the server log:
>
> ERROR:  option "invalid" = "value" is unknown
> CONTEXT:  slot "test1", output plugin "test_decoding", in the startup
> callback
>
> But the client has no way to learn about the error, nor can it restart
> with correct one (the server has entered COPY protocol mode):
>
> psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
> PQexec not allowed during COPY BOTH
>
> I recall Craig Rigner mentioning this issue in context of the
> pglogical_output plugin, but I thought that was something to do with the
> startup packet.  The behavior above doesn't strike me as very consistent,
> we should be able to detect and report errors during output plugin startup
> and let the client retry with the corrected command as we do for syntax or
> other errors.
>
> I didn't look in the code yet, but if someone knows off top of the head
> the reason to this behavior, I'd be glad to learn it.
>

Looks like the attached patch fixes it for me:

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0 ("invalid"
'value');
ERROR:  option "invalid" = "value" is unknown
CONTEXT:  slot "test1", output plugin "test_decoding", in the startup
callback

psycopg2test=# START_REPLICATION SLOT "test1" LOGICAL 0/0;
unexpected PQresultStatus: 8

There was a similar problem with PHYSICAL replication--if
the requested start LSN is in the future, the error is not reported to the
client and connection is unusable:

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
unexpected PQresultStatus: 8

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
PQexec not allowed during COPY BOTH

After the patch:

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 1FF0/0;
ERROR:  requested starting point 1FF0/0 is ahead of the WAL flush position
of this server 0/2DE6C28

psycopg2test=# START_REPLICATION SLOT "test2" PHYSICAL 0/0;
unexpected PQresultStatus: 8


Actually, would it make sense to move the block that sends CopyBothResponse
message to the WalSndLoop() instead?  I think you can't get any closer than
that...

--
Alex
From 4ebfa71a1d4e4d43415ad9c2a8142f45444313ec Mon Sep 17 00:00:00 2001
From: Oleksandr Shulgin 
Date: Tue, 5 Jan 2016 11:34:37 +0100
Subject: [PATCH] Move CopyBothResponse closer to WalSndLoop()

This is required in order for any initialization errors to be properly
reported to the client and to avoid entering COPY BOTH mode
prematurely (the client will not be able to correct the error after
that even if it could learn about it somehow).
---
 src/backend/replication/walsender.c | 39 -
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c03e045..3b83db7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -629,13 +629,6 @@ StartReplication(StartReplicationCmd *cmd)
 		 */
 		WalSndSetState(WALSNDSTATE_CATCHUP);
 
-		/* Send a CopyBothResponse message, and start streaming */
-		pq_beginmessage(, 'W');
-		pq_sendbyte(, 0);
-		pq_sendint(, 0, 2);
-		pq_endmessage();
-		pq_flush();
-
 		/*
 		 * Don't allow a request to stream from a future point in WAL that
 		 * hasn't been flushed to disk in this server yet.
@@ -667,6 +660,18 @@ StartReplication(StartReplicationCmd *cmd)
 		/*