On Tue, Mar 5, 2019 at 10:08 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > I wrote: > > Thomas Munro <thomas.mu...@gmail.com> writes: > >> That suggests that we could perhaps handle ECONNRESET both at startup > >> packet send time (for certificate rejection, eelpout's case) and at > >> initial query send (for idle timeout, bug #15598's case) by attempting > >> to read. Does that make sense? > > > Hmm ... it definitely makes sense that we shouldn't assume that a *write* > > failure means there is nothing left to *read*. > > After staring at the code for awhile, I am thinking that there may be > a bug of that ilk, but if so it's down inside OpenSSL. Perhaps it's > specific to the OpenSSL version you're using on eelpout? There is > not anything we could do differently in libpq, AFAICS, because it's > OpenSSL's responsibility to read any data that might be available. > > I also looked into the idea that we're doing something wrong on the > server side, allowing the final error message to not get flushed out. > A plausible theory there is that SSL_shutdown is returning a WANT_READ > or WANT_WRITE error and we should retry it ... but that doesn't square > with your observation upthread that it's returning SSL_ERROR_SSL. > > It's all very confusing, but I think there's a nontrivial chance > that this is an OpenSSL bug, especially since we haven't been able > to replicate it elsewhere.
Hmm. Yes, it is strange that we haven't seen it elsewhere, but remember that very few animals are running the ssl tests; also it requires particular timing to hit. OK, here's something. I can reproduce it quite easily on this machine, and I can fix it like this: diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index f29202db5f..e9c137f1bd 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -2705,6 +2705,7 @@ keep_going: /* We will come back to here until there is libpq_gettext("could not send startup packet: %s\n"), SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf))); free(startpacket); + pqHandleSendFailure(conn); goto error_return; } If I add some printf debugging in there, I can see that block being reached every hundred or so times I try to connect with a revoked certificate, and with that extra call to pqHandleSendFailure() in there the error comes out as it should: psql: SSL error: sslv3 alert certificate revoked Now I'm confused because we already have handling like that in PQsendQuery(), so I can't explain bug #15598. -- Thomas Munro https://enterprisedb.com