Re: GSSENC'ed connection stalls while reconnection attempts.
At Tue, 14 Jul 2020 13:31:31 +0900 (JST), Kyotaro Horiguchi wrote in > Agreed to separate the change from this issue. I also don't think > that change in behavior dramatically improve the situation since we > should have had a bunch of trouble when a write actually failed in the > normal case. > > I'm going to post a patch to change the comment of pqPacketSend. So this is a proposal to add a description about the behavior on write failure. The last half of the addition is a copy from the comment of pqFlush. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 5e01ed1fccce083b1e1c13692d19474378c66db6 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 15 Jul 2020 11:27:57 +0900 Subject: [PATCH] Explain pqPacketSend more precisely. The comment explains that the function always returns STATUS_ERRUR for write failure, but this may not be the case. Add description to make it clear that there is such a case. --- src/interfaces/libpq/fe-connect.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 7bee9dd201..af2818add4 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -4505,7 +4505,9 @@ PQrequestCancel(PGconn *conn) * buf, buf_len: contents of message. The given length includes only what * is in buf; the message type and message length fields are added here. * - * RETURNS: STATUS_ERROR if the write fails, STATUS_OK otherwise. + * RETURNS: STATUS_ERROR if the write fails, STATUS_OK otherwise. Note that the + * function may return STATUS_OK even if it failed to write to the underlying + * socket. (See pqSendSome comments about how failure should be handled.) * SIDE_EFFECTS: may block. * * Note: all messages sent with this routine have a length word, whether -- 2.18.4
Re: GSSENC'ed connection stalls while reconnection attempts.
At Mon, 13 Jul 2020 11:08:09 -0400, Tom Lane wrote in > Kyotaro Horiguchi writes: > > At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane wrote > > intgl> >> I'm disinclined to mess with that, because (a) I don't think it's > > the > >> actual source of the problem, and (b) it would affect way more than > >> just GSS mode. > > > If I did that in pqFlush your objection would be right, but > > pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails" > > so not doing that is just wrong. (pqSendSome reported write failure in > > this case.) For other parts in authentication code, I don't think it > > doesn't affect badly because authentication should proceed without any > > read/write overlapping. > > As the comment for pqSendSome says, we report a write failure immediately > only if we also cannot read. I don't really see a reason why the behavior > described there isn't fine during initial connection as well. If you feel > that the comment for pqPacketSend needs adjustment, we can do that. I'm fine with that. > In any case, I'm quite against changing pqPacketSend's behavior because > "it's only used during initial connection"; there is nothing about the > function that restricts it to that case. That sounds fair enough. > Bottom line here is that I'm suspicious of changing the behavior of > the read/write code on the strength of a bug in the GSS state management > logic. If there's a reason to change the read/write code, we should be > able to demonstrate it without the GSS bug. Agreed to separate the change from this issue. I also don't think that change in behavior dramatically improve the situation since we should have had a bunch of trouble when a write actually failed in the normal case. I'm going to post a patch to change the comment of pqPacketSend. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: GSSENC'ed connection stalls while reconnection attempts.
Kyotaro Horiguchi writes: > At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane wrote in >> The attached patch makes this all act more like the way SSL is handled, >> and for me it resolves the reconnection problem. > It looks good to me. OK, thanks. >>> The reason that psql doesn't notice the error is pqPacketSend returns >>> STATUS_OK when write error occurred. That behavior contradicts to the >>> comment of the function. The function is used only while making >>> connection so it's ok to error out immediately by write failure. I >>> think other usage of pqFlush while making a connection should report >>> write failure the same way. >> I'm disinclined to mess with that, because (a) I don't think it's the >> actual source of the problem, and (b) it would affect way more than >> just GSS mode. > If I did that in pqFlush your objection would be right, but > pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails" > so not doing that is just wrong. (pqSendSome reported write failure in > this case.) For other parts in authentication code, I don't think it > doesn't affect badly because authentication should proceed without any > read/write overlapping. As the comment for pqSendSome says, we report a write failure immediately only if we also cannot read. I don't really see a reason why the behavior described there isn't fine during initial connection as well. If you feel that the comment for pqPacketSend needs adjustment, we can do that. In any case, I'm quite against changing pqPacketSend's behavior because "it's only used during initial connection"; there is nothing about the function that restricts it to that case. Bottom line here is that I'm suspicious of changing the behavior of the read/write code on the strength of a bug in the GSS state management logic. If there's a reason to change the read/write code, we should be able to demonstrate it without the GSS bug. regards, tom lane
Re: GSSENC'ed connection stalls while reconnection attempts.
At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane wrote in > Kyotaro Horiguchi writes: > > If psql connected using GSSAPI auth and server restarted, reconnection > > sequence stalls and won't return. > > Yeah, reproduced here. (I wonder if there's any reasonable way to > exercise this scenario in src/test/kerberos/.) > > > I found that psql(libpq) sends startup packet via gss > > encryption. conn->gssenc should be reset when encryption state is > > freed. > > Actually, it looks to me like the GSS support was wedged in by somebody > who was paying no attention to how SSL is managed, or else we forgot > to pay attention to GSS the last time we rearranged SSL support. It's > completely broken for the multiple-host-addresses scenario as well, > because try_gss is being set and cleared in the wrong places altogether. > conn->gcred is not being handled correctly either I think --- we need > to make sure that it's dropped in pqDropConnection. > > The attached patch makes this all act more like the way SSL is handled, > and for me it resolves the reconnection problem. It looks good to me. > > The reason that psql doesn't notice the error is pqPacketSend returns > > STATUS_OK when write error occurred. That behavior contradicts to the > > comment of the function. The function is used only while making > > connection so it's ok to error out immediately by write failure. I > > think other usage of pqFlush while making a connection should report > > write failure the same way. > > I'm disinclined to mess with that, because (a) I don't think it's the > actual source of the problem, and (b) it would affect way more than > just GSS mode. If I did that in pqFlush your objection would be right, but pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails" so not doing that is just wrong. (pqSendSome reported write failure in this case.) For other parts in authentication code, I don't think it doesn't affect badly because authentication should proceed without any read/write overlapping. > > Finally, It's user-friendly if psql shows error message for error on > > reset attempts. (This perhaps should be arguable.) > > Meh, that's changing fairly longstanding behavior that I don't think > we've had many complaints about. Yeah, I haven't seen the message for any other reasons than the absence of server. :p And, I noticed that, in the first place, I would see that message in the next connection attempt from scratch. I agree to you on this point. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: GSSENC'ed connection stalls while reconnection attempts.
I wrote: > Kyotaro Horiguchi writes: >> If psql connected using GSSAPI auth and server restarted, reconnection >> sequence stalls and won't return. > Yeah, reproduced here. (I wonder if there's any reasonable way to > exercise this scenario in src/test/kerberos/.) I tried writing such a test based on the IO::Pty infrastructure used by src/bin/psql/t/010_tab_completion.pl, as attached. It works, but it feels pretty grotty, especially seeing that so much of the patch is copy-and-pasted from 010_tab_completion.pl. I think if we want to have a test like this, it'd be good to work a little harder on refactoring so that more of that code can be shared. My Perl skillz are a bit weak for that, though. regards, tom lane diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl index b3aeea9574..552d2724e7 100644 --- a/src/test/kerberos/t/001_auth.pl +++ b/src/test/kerberos/t/001_auth.pl @@ -13,17 +13,30 @@ use strict; use warnings; + use TestLib; use PostgresNode; use Test::More; +use IPC::Run qw(pump finish timer); +use Data::Dumper; -if ($ENV{with_gssapi} eq 'yes') +# Do nothing unless Makefile has told us that the build is --with-gssapi. +if ($ENV{with_gssapi} ne 'yes') { + plan skip_all => 'GSSAPI/Kerberos not supported by this build'; +} + +# If we don't have IO::Pty, we can't run the test with interactive_psql. +my $have_pty = 1; +eval { require IO::Pty; }; +if ($@) +{ + $have_pty = 0; plan tests => 18; } else { - plan skip_all => 'GSSAPI/Kerberos not supported by this build'; + plan tests => 22; } my ($krb5_bin_dir, $krb5_sbin_dir); @@ -275,6 +288,77 @@ test_query( "gssencmode=require", "sending 100K lines works"); +# Test that libpq can reconnect after a server restart. +if ($have_pty) +{ + # fire up an interactive psql session + my $in = ''; + my $out = ''; + + my $timer = timer(5); + + my $h = $node->interactive_psql( + 'postgres', + \$in, + \$out, + $timer, + extra_params => [ + '-XAtd', + $node->connstr('postgres') + . " host=$host hostaddr=$hostaddr gssencmode=require user=test1" + ]); + + like( + $out, + qr/GSSAPI-encrypted connection/, + "startup banner shows GSSAPI encryption"); + + # subroutine to send a command and wait for response + sub interactive_command + { + my ($send, $pattern, $annotation) = @_; + + # report test failures from caller location + local $Test::Builder::Level = $Test::Builder::Level + 1; + + # reset output collector + $out = ""; + # restart per-command timer + $timer->start(5); + # send the data to be sent + $in .= $send; + # wait ... + pump $h until ($out =~ $pattern || $timer->is_expired); + my $okay = ($out =~ $pattern && !$timer->is_expired); + ok($okay, $annotation); + # for debugging, log actual output if it didn't match + local $Data::Dumper::Terse = 1; + local $Data::Dumper::Useqq = 1; + diag 'Actual output was ' + . Dumper($out) + . "Did not match \"$pattern\"\n" + if !$okay; + return; + } + + interactive_command("SELECT 2+2;\n", qr/4/, "interactive psql is alive"); + + $node->restart; + + interactive_command( + "SELECT 2+2;\n", + qr/The connection to the server was lost.*GSSAPI-encrypted connection/s, + "startup banner shows GSSAPI encryption after reconnection"); + + interactive_command("SELECT 20+22;\n", qr/42/, "psql is still alive"); + + # send psql an explicit \q to shut it down, else pty won't close properly + $timer->start(5); + $in .= "\\q\n"; + finish $h or die "psql returned $?"; + $timer->reset; +} + unlink($node->data_dir . '/pg_hba.conf'); $node->append_conf('pg_hba.conf', qq{hostgssenc all all $hostaddr/32 gss map=mymap});
Re: GSSENC'ed connection stalls while reconnection attempts.
Kyotaro Horiguchi writes: > If psql connected using GSSAPI auth and server restarted, reconnection > sequence stalls and won't return. Yeah, reproduced here. (I wonder if there's any reasonable way to exercise this scenario in src/test/kerberos/.) > I found that psql(libpq) sends startup packet via gss > encryption. conn->gssenc should be reset when encryption state is > freed. Actually, it looks to me like the GSS support was wedged in by somebody who was paying no attention to how SSL is managed, or else we forgot to pay attention to GSS the last time we rearranged SSL support. It's completely broken for the multiple-host-addresses scenario as well, because try_gss is being set and cleared in the wrong places altogether. conn->gcred is not being handled correctly either I think --- we need to make sure that it's dropped in pqDropConnection. The attached patch makes this all act more like the way SSL is handled, and for me it resolves the reconnection problem. > The reason that psql doesn't notice the error is pqPacketSend returns > STATUS_OK when write error occurred. That behavior contradicts to the > comment of the function. The function is used only while making > connection so it's ok to error out immediately by write failure. I > think other usage of pqFlush while making a connection should report > write failure the same way. I'm disinclined to mess with that, because (a) I don't think it's the actual source of the problem, and (b) it would affect way more than just GSS mode. > Finally, It's user-friendly if psql shows error message for error on > reset attempts. (This perhaps should be arguable.) Meh, that's changing fairly longstanding behavior that I don't think we've had many complaints about. regards, tom lane diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 27c9bb46ee..7bee9dd201 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -477,6 +477,11 @@ pqDropConnection(PGconn *conn, bool flushInput) { OM_uint32 min_s; + if (conn->gcred != GSS_C_NO_CREDENTIAL) + { + gss_release_cred(&min_s, &conn->gcred); + conn->gcred = GSS_C_NO_CREDENTIAL; + } if (conn->gctx) gss_delete_sec_context(&min_s, &conn->gctx, GSS_C_NO_BUFFER); if (conn->gtarg_nam) @@ -496,6 +501,7 @@ pqDropConnection(PGconn *conn, bool flushInput) free(conn->gss_ResultBuffer); conn->gss_ResultBuffer = NULL; } + conn->gssenc = false; } #endif #ifdef ENABLE_SSPI @@ -2027,11 +2033,6 @@ connectDBStart(PGconn *conn) */ resetPQExpBuffer(&conn->errorMessage); -#ifdef ENABLE_GSS - if (conn->gssencmode[0] == 'd') /* "disable" */ - conn->try_gss = false; -#endif - /* * Set up to try to connect to the first host. (Setting whichhost = -1 is * a bit of a cheat, but PQconnectPoll will advance it to 0 before @@ -2468,6 +2469,9 @@ keep_going: /* We will come back to here until there is conn->allow_ssl_try = (conn->sslmode[0] != 'd'); /* "disable" */ conn->wait_ssl_try = (conn->sslmode[0] == 'a'); /* "allow" */ #endif +#ifdef ENABLE_GSS + conn->try_gss = (conn->gssencmode[0] != 'd'); /* "disable" */ +#endif reset_connection_state_machine = false; need_new_connection = true; @@ -3349,12 +3353,8 @@ keep_going: /* We will come back to here until there is */ if (conn->gssenc && conn->gssencmode[0] == 'p') { - OM_uint32 minor; - /* postmaster expects us to drop the connection */ conn->try_gss = false; - conn->gssenc = false; - gss_delete_sec_context(&minor, &conn->gctx, NULL); pqDropConnection(conn, true); conn->status = CONNECTION_NEEDED; goto keep_going; @@ -3906,9 +3906,6 @@ makeEmptyPGconn(void) conn->verbosity = PQERRORS_DEFAULT; conn->show_context = PQSHOW_CONTEXT_ERRORS; conn->sock = PGINVALID_SOCKET; -#ifdef ENABLE_GSS - conn->try_gss = true; -#endif /* * We try to send at least 8K at a time, which is the usual size of pipe @@ -4065,22 +4062,6 @@ freePGconn(PGconn *conn) free(conn->gsslib); if (conn->connip) free(conn->connip); -#ifdef ENABLE_GSS - if (conn->gcred != GSS_C_NO_CREDENTIAL) - { - OM_uint32 minor; - - gss_release_cred(&minor, &conn->gcred); - conn->gcred = GSS_C_NO_CREDENTIAL; - } - if (conn->gctx) - { - OM_uint32 minor; - - gss_delete_sec_context(&minor, &conn->gctx, GSS_C_NO_BUFFER); - conn->gctx = NULL; - } -#endif /* Note that conn->Pfdebug is not ours to close or free */ if (conn->last_query) free(conn->last_query);
GSSENC'ed connection stalls while reconnection attempts.
Hello. If psql connected using GSSAPI auth and server restarted, reconnection sequence stalls and won't return. I found that psql(libpq) sends startup packet via gss encryption. conn->gssenc should be reset when encryption state is freed. The reason that psql doesn't notice the error is pqPacketSend returns STATUS_OK when write error occurred. That behavior contradicts to the comment of the function. The function is used only while making connection so it's ok to error out immediately by write failure. I think other usage of pqFlush while making a connection should report write failure the same way. Finally, It's user-friendly if psql shows error message for error on reset attempts. (This perhaps should be arguable.) The attached does the above. Any thoughts and/or opinions are welcome. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 8cc5663f31744b384cf013f86062ea28b431baa6 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 10 Jul 2020 17:30:11 +0900 Subject: [PATCH] Fix reconnection failure of GSSENC connections A connection using GSS encryption fails to reconnect and freezes. Fix that by resetting GSS encryption state on dropping a connection. On the way fixing that, fix the behavior for write failure while connection attempts. Also let psql report the cause of reset attempt failure. --- src/bin/psql/common.c | 2 +- src/interfaces/libpq/fe-auth.c| 7 ++- src/interfaces/libpq/fe-connect.c | 13 ++--- 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index 6323a35c91..d5ab1d8ada 100644 --- a/src/bin/psql/common.c +++ b/src/bin/psql/common.c @@ -310,7 +310,7 @@ CheckConnection(void) OK = ConnectionUp(); if (!OK) { - fprintf(stderr, _("Failed.\n")); + fprintf(stderr, _("Failed: %s\n"), PQerrorMessage(pset.db)); /* * Transition to having no connection. Keep this bit in sync with diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c index 9f5403d74c..3664ee10a0 100644 --- a/src/interfaces/libpq/fe-auth.c +++ b/src/interfaces/libpq/fe-auth.c @@ -588,7 +588,12 @@ pg_SASL_init(PGconn *conn, int payloadlen) } if (pqPutMsgEnd(conn)) goto error; - if (pqFlush(conn)) + + /* + * We must not fail write here. Write failure needs to be checked + * explicitly. See pqSendSome. + */ + if (pqFlush(conn) || conn->write_failed) goto error; termPQExpBuffer(&mechanism_buf); diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 27c9bb46ee..f897003f42 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -496,6 +496,8 @@ pqDropConnection(PGconn *conn, bool flushInput) free(conn->gss_ResultBuffer); conn->gss_ResultBuffer = NULL; } + + conn->gssenc = false; } #endif #ifdef ENABLE_SSPI @@ -3455,8 +3457,10 @@ keep_going: /* We will come back to here until there is * Just make sure that any data sent by pg_fe_sendauth is * flushed out. Although this theoretically could block, it * really shouldn't since we don't send large auth responses. + * Write failure needs to be checked explicitly. See + * pqSendSome. */ -if (pqFlush(conn)) +if (pqFlush(conn) || conn->write_failed) goto error_return; if (areq == AUTH_REQ_OK) @@ -4546,8 +4550,11 @@ pqPacketSend(PGconn *conn, char pack_type, if (pqPutMsgEnd(conn)) return STATUS_ERROR; - /* Flush to ensure backend gets it. */ - if (pqFlush(conn)) + /* + * Flush to ensure backend gets it. Write failure needs to be checked + * explicitly. See pqSendSome. + */ + if (pqFlush(conn) || conn->write_failed) return STATUS_ERROR; return STATUS_OK; -- 2.18.4