[ https://issues.apache.org/jira/browse/TS-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606710#comment-14606710 ]
ASF GitHub Bot commented on TS-3714: ------------------------------------ GitHub user sudheerv opened a pull request: https://github.com/apache/trafficserver/pull/237 [TS-3714]: Summary of changes below: a) Issue a SSL_read right after SSL handshake to ensure data already in the SSL buffers is not lost. b) Add vc to net thread's read_ready_list immediately after accept, to ensure data already in the socket buffers is not lost. c) Add a timer for SSL handshake with default to no timer (as today). d) Fix a bunch of error cases to correctly release resources. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/trafficserver ts3714 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/trafficserver/pull/237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #237 ---- commit 8525ab9b69fd41072b3107d2512cc43eb4917738 Author: Sudheer Vinukonda <sudhe...@yahoo-inc.com> Date: 2015-06-30T00:07:24Z [TS-3714]: Summary of changes below: a) Issue a SSL_read right after SSL handshake to ensure data already in the SSL buffers is not lost. b) Add vc to net thread's read_ready_list immediately after accept, to ensure data already in the socket buffers is not lost. c) Add a timer for SSL handshake with default to no timer (as today). d) Fix a bunch of error cases to correctly release resources. ---- > TS seems to stall between ua_begin and ua_first_read on some transactions > resulting in high response times. > ----------------------------------------------------------------------------------------------------------- > > Key: TS-3714 > URL: https://issues.apache.org/jira/browse/TS-3714 > Project: Traffic Server > Issue Type: Bug > Components: Core > Affects Versions: 5.3.0 > Reporter: Sudheer Vinukonda > Assignee: Sudheer Vinukonda > > An example slow log showing very high *ua_first_read*. > {code} > ERROR: [8624075] Slow Request: client_ip: xx.xx.xx.xxx url: > http://xxxxxxxxxxxxxxxxxx.... status: 200 unique id: bytes: 86 fd: 0 client > state: 0 serv > er state: 9 ua_begin: 0.000 ua_first_read: 42.224 ua_read_header_done: 42.224 > cache_open_rea > d_begin: 42.224 cache_open_read_end: 42.224 dns_lookup_begin: 42.224 > dns_lookup_end: 42.224 > server_connect: 42.224 server_first_read: 42.229 server_read_header_done: > 42.229 server_clos > e: 42.229 ua_begin_write: 42.229 ua_close: 42.229 sm_finish: 42.229 > {code} > Initially, I suspected that it might be caused by browser's connecting early > before sending any bytes to TS. However, this seems to be happening too > frequently and with unrealistically high delay between *ua_begin* and > *ua_first_read*. > I suspect it's caused due to the code that disables the read temporarily > before calling *TXN_START* hook and re-enables it after the API call out. The > read disable is done via a 0-byte *do_io_read* on the client vc, but, the > problem is that a valid *mbuf* is still passed. Based on what I am seeing, > it's possible this results in actually enabling the *read_vio* all the way > uptil *ssl_read_from_net* for instance (if there's a race condition and there > were bytes already from the client resulting in an epoll read event), which > would finally disable the read since, the *ntodo* (nbytes) is 0. However, > this may result in the epoll event being lost until a new i/o happens from > the client. I'm trying out further experiments to confirm the theory. In most > cases, the read buffer already has some bytes by the time the http session > and http sm is created, which makes it just work. But, if there's a slight > delay in the receiving bytes after making a connection, the epoll mechanism > should read it, but, due to the way the temporary read disable is being done, > the event may be lost (this is coz, ATS uses the epoll edge triggered mode). > Some history on this issue - > This issue has been a problem for a long time and affects both http and https > requests. When this issue was first reported, our router operations team > eventually closed it indicating that disabling *accept* threads resolved it > ([~yzlai] also reported similar observations and conclusions). It's possible > that the race condition window may be slightly reduced by disabling accept > threads, but, to the overall performance and through put, accept threads are > very important. I now notice that the issue still exists (regardless of > whether or not accept threads are enabled/disabled) and am testing further to > confirm the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)