[ https://issues.apache.org/jira/browse/TS-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sudheer Vinukonda reassigned TS-3226: ------------------------------------- Assignee: Sudheer Vinukonda > SSL data not read from the socket sometimes causing transactions to timeout > --------------------------------------------------------------------------- > > Key: TS-3226 > URL: https://issues.apache.org/jira/browse/TS-3226 > Project: Traffic Server > Issue Type: Bug > Components: SSL > Affects Versions: 5.1.1 > Reporter: Sudheer Vinukonda > Assignee: Sudheer Vinukonda > Fix For: 5.2.0 > > > We have had a really long standing problem where some of our origins were > complaining of receiving POST requests with non-zero content-length header, > but, no body (or sometimes, partial body). Due to the way our network was > setup, this problem was not easy to be isolated due to the multiple hops > along the way. The post body could be lost anywhere along the path (e.g. > client, dns, routers/vips, edge, data center etc). After a lot of debugging > and with the help of some custom-built wire traces for SSL, we managed to > isolate the problem to our ATS hosts running on our edge layer. From the wire > traces, we could see that, the post body is coming in alright, but is just > sitting in the socket and not being read by the post ua tunnel producer. > After further investigation, it seems that the producer is issuing the > correct do_io_read for the required number of bytes, but, there seems to be a > bug in the {{SSLNetVConnection::net_read_io}}, where the ntodo is being > calculated before acquiring the mutex on the read vio. > https://github.com/apache/trafficserver/blob/master/iocore/net/SSLNetVConnection.cc#L391 > Instrumenting the code with further debug traces showed that, in the failed > transactions, I am noticing the ntodo being "0" when determined before the > mutex, whereas the (s->vio.nbytes - s->vio.ndone) is non-zero after the > mutex. I am not sure to understand how the nbytes on the read vio object can > be different before acquiring mutex, but, moving the ntodo calculation after > mutex seems to have resolved the problem. Note that this is how it is done in > the corresponding function {{read_from_net}} in {{UnixNetVConnection}}. > Talking to [~amc] on the IRC, it seems that the mutex is needed coz, the > {{SSLNetVConnection::net_read_io}} could also be triggered by an incoming > socket data before the {{UnixNetVConnection::do_io_read}} could trigger it > and that could mess up the nbytes/ndone in the read vio. -- This message was sent by Atlassian JIRA (v6.3.4#6332)