On Mon, 2015-07-13 at 16:19 +0200, Tor Houghton wrote: 
> On Mon, Jul 13, 2015 at 10:52:46PM +0930, Jack Burton wrote:
> > > 
> > > I don't pretend to know httpd (at all), but I'm wondering, what should
> > > fstat(1) say, over time, for the httpd processes?
> > 
> > Thanks Tor -- that was exactly the clue I needed to isolate the
> > problem.
> > 
> > [snip]
> >
> > admin talks to a custom FastCGI daemon, which is most likely the culprit
> > -- I'll debug it tomorrow.
<...> 
> 
> I am not sure you should conclude yet. I don't use FastCGI. ;-}
> 
> Now, as I write, I have 218 open fd's, compared to the 206 or whatever I had
> in my previous post. I've got a few "dangling" :443 streams (the :80 ones
> seem to disappear like they should), and then a bunch of these:

You're absolutely right -- I spoke too soon.

After double-checking that every possible path a request could take
through the custom FastCGI daemon used by admin ends by sending an
FCGI_END_REQUEST record back to httpd (it does), I turned my attention
back to the httpd logs & debug messages gathered.

This time I had my little script check the remote IP addresses of those
socket against all the httpd access logs (not just the current ones) and
where nothing matched there, finally check the httpd debug output too.

Again, only the admin server (the only one here that's Internet-facing)
had stale sockets (all open sockets for redir & portal matched log
entries) -- out of 26 open sockets, 4 matched log entries for current
HTTPS sessions, 2 matched "buffer event error" debug messages and the
other 20 didn't match in either the logs or debug messages.

I still don't know what's causing the buffer event error messages, but
as they accounted for only 2 of the 22 stale sockets, I figured it was
more important to focus on the other 20 first.

So, what sort of HTTPS event doesn't make it into the logs and doesn't
cause any debug messages containing the remote IP address to be emitted
either?

The only thing I could think of was a TCP connection to port 443 where
the remote end doesn't initiate a TLS handshake (that's nowhere near as
improbable as it sounds: think a simple port scan, or a network outage
commencing directly after the first ACK).

So, as a test I tried just that: establishing a TCP session from a
remote host then closing it without sending anything at all at layer 5.

Naturally, doing that where httpd expects plain HTTP causes only a
single debug message to be emitted ("...done"), and the socket gets
closed as expected.

But doing it where httpd expects HTTPS and the local side of the socket
remains open, nothing appears in the regular logs, and nothing
identifiable by remote IP address appears in debug messages either.

Trying to match log/debug entries that aren't identified by the remote
IP address on a host with even a modest amount of traffic struck me as
an exercise in futility, so I tried the same experiment on another host
(also running 5.7-stable) with no other load on httpd at all.

Result was the same: httpd did not close the socket or log anything in
the regular logs. However, one debug message was emitted, our old friend
"server_accept_tls: TLS accept failed - (null)"...

...which brings us right back to where this thread started.

Looking at the source, server_accept_tls() handles two types of
non-recoverable error condition: timeout after retry and outright
failure. In the first case (EV_TIMEOUT), server_accept_tls() calls
server_close() (which in turn calls server_close_http(), which closes
the socket) before returning; in the second case it does not.

I believe this is the bug we've been looking for.

The fix is trivial -- see attached patch (against 5.7-stable -- sorry,
I don't have any hosts running -current at present).

That works for me (tested here on two hosts: sparc64 with test load
only; and amd64 with modest production load).

Not sure if that's the best approach or not, but now that we've at
least established root cause, if there's a better way I'm sure someone
else on the list will point it out.

[demime 1.01d removed an attachment of type text/x-patch which had a name of 
httpd_server_accept_tls.patch"; charset="UTF-8]

Reply via email to