[ 
https://issues.apache.org/jira/browse/TS-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542796#comment-14542796
 ] 

Susan Hinrichs commented on TS-3597:
------------------------------------

Ok, I spoke too soon about being able to reproduce the problem.  In my dev 
environment, I get no TCP handshake completion if I turn off the accept_thread. 
 In reverse proxy mode, I get an assert in UnixNetVConnection::set_enabled 
called from do_io_read because the lock is not held.

It is possible this is what you are seeing in production, since this isn't a 
release assert and you are seeing timing issues since no lock is held during 
the accept processing.  Or this is something completely different and unique to 
my environment.  For the record, the commit identified has nothing to do with 
the TCP accept processing.

For your reading pleasure, here is my stack. Will dig more tomorrow.
{code}
#0  0x000000351e4328a5 in raise () from /lib64/libc.so.6
#1  0x000000351e434085 in abort () from /lib64/libc.so.6
#2  0x00007ffff7dd9c51 in ink_die_die_die () at ink_error.cc:43
#3  0x00007ffff7dd9d08 in ink_fatal_va(const char *, typedef __va_list_tag 
__va_list_tag *) (fmt=0x7ffff7deaa58 "%s:%d: failed assert `%s`", 
ap=0x7fffffffdca0)
    at ink_error.cc:65
#4  0x00007ffff7dd9dd9 in ink_fatal (
    message_format=0x7ffff7deaa58 "%s:%d: failed assert `%s`")
    at ink_error.cc:73
#5  0x00007ffff7dd7876 in _ink_assert (
    expression=0x83a988 "vio->mutex->thread_holding == this_ethread() && 
thread", file=0x83a6be "UnixNetVConnection.cc", line=859) at ink_assert.cc:37
#6  0x000000000078c4bd in UnixNetVConnection::set_enabled (this=0x3c0ed20, 
    vio=0x3c0ee40) at UnixNetVConnection.cc:859
#7  0x000000000078bbb4 in UnixNetVConnection::reenable (this=0x3c0ed20, 
    vio=0x3c0ee40) at UnixNetVConnection.cc:753
#8  0x000000000050d229 in VIO::reenable (this=0x3c0ee40)
    at ../iocore/eventsystem/P_VIO.h:112
#9  0x000000000078b25c in UnixNetVConnection::do_io_read (this=0x3c0ed20, 
    c=0x24a1180, nbytes=4096, buf=0x3357620) at UnixNetVConnection.cc:598
#10 0x00000000005594bd in ProtocolProbeSessionAccept::mainEvent (
    this=0x24a92c0, event=202, data=0x3c0ed20)
    at ProtocolProbeSessionAccept.cc:148
#11 0x000000000050d1d6 in Continuation::handleEvent (this=0x24a92c0, 
    event=202, data=0x3c0ed20) at ../iocore/eventsystem/I_Continuation.h:145
#12 0x00000000007863e8 in NetAccept::acceptFastEvent (this=0x2480960, event=5, 
    ep=0x1ee5160) at UnixNetAccept.cc:465
#13 0x000000000050d1d6 in Continuation::handleEvent (this=0x2480960, event=5, 
    data=0x1ee5160) at ../iocore/eventsystem/I_Continuation.h:145
#14 0x00000000007abcb2 in EThread::process_event (this=0x1bb0000, e=0x1ee5160, 
    calling_code=5) at UnixEThread.cc:128
---Type <return> to continue, or q <return> to quit---
#15 0x00000000007ac2d3 in EThread::execute (this=0x1bb0000)
    at UnixEThread.cc:252
#16 0x000000000054097e in main (argv=0x7fffffffe398) at Main.cc:1840
{code}

> TLS can fail accept / handshake since commit 2a8bb593fd
> -------------------------------------------------------
>
>                 Key: TS-3597
>                 URL: https://issues.apache.org/jira/browse/TS-3597
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: SSL
>            Reporter: Leif Hedstrom
>            Assignee: Susan Hinrichs
>            Priority: Critical
>             Fix For: 6.0.0
>
>
> At least under certain conditions (slightly unclear,but possible a race with 
> multiple NUMA nodes), we fail to accept / TLS handshake. I've tracked this 
> down to the commit from 2a8bb593fdd7ca9125efad76e27f3f17f5bca794.
> The commit prior to this does not expose the problem. [~gancho] also 
> discovered that this problem is only triggered when accept thread is off (0).
> Also from [~gancho], when this reproduces, a command like e.g. this will fail 
> the handshake completely (no ciphers):
> {code}
> openssl s_client -connect 10.1.2.3:443 -tls1 -servername some.host.com
> {code}
> Also, since this only happens with accept thread off (0), which implies 
> accept on every ET_NET thread, maybe there's some sort of race condition 
> going on here? That's just a wild speculation though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to