Sudheer Vinukonda created TS-3226:
-------------------------------------
Summary: SSL data not read from the socket sometimes causing
transactions to timeout
Key: TS-3226
URL: https://issues.apache.org/jira/browse/TS-3226
Project: Traffic Server
Issue Type: Bug
Components: SSL
Reporter: Sudheer Vinukonda
We have had a problem where some of our origins were complaining of receiving
POST requests with non-zero content-length header, but, no body (or sometimes,
partial body). Due to the way our network was setup, this problem was not easy
to be isolated due to the various multiple hops along the way. The post body
could be lost anywhere along the path (e.g. client, dns, routers/vips, edge,
data center etc). After a lot of debugging and with the help of some
custom-built wire traces for SSL, we managed to isolate the problem to our ATS
hosts running on our edge layer. From the wire traces, we could see that, the
post body is coming in alright, but is just sitting in the socket and not being
read by the post ua tunnel producer.
After further investigation, it seems that the producer is issuing the correct
do_io_read for the required number of bytes, but, there seems to be a bug in
the {{SSLNetVConnection::net_read_io}}, where the ntodo is being calculated
before acquiring the mutex on the read vio.
https://github.com/apache/trafficserver/blob/master/iocore/net/SSLNetVConnection.cc#L391
Instrumenting the code with further debug traces showed that, in the failed
transactions, I am noticing the ntodo being "0" when determined before the
mutex, whereas the (s->vio.nbytes - s->vio.ndone) is non-zero after the mutex.
I am not sure to understand how the nbytes on the read vio object can be
different before acquiring mutex, but, moving the ntodo calculation after mutex
seems to have resolved the problem. Note that this is how it is done in the
corresponding function {{read_from_net}} in {{UnixNetVConnection}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)