[ 
https://issues.apache.org/jira/browse/TS-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175829#comment-14175829
 ] 

Susan Hinrichs commented on TS-3105:
------------------------------------

Got direct access to gdb on a machine experiencing these problems.  It seems 
there were multiple issues at play here.  The stack trace above (assert or 
segfault in tunnel_handler_ua) seemed to be due to doing IO_SHUTDOWN_READ on 
the user agent vc when the client_info indicated HTTP_KEEP_ALIVE.  Making a 
check on client_info.keep_alive before every IO_SHUTDOWN_READ seemed to solve 
that problem.

Then there was the issue of calling producer_run multiple times on the same 
producer.  Producer_run does cleanup at the end, and seems to only be intended 
to run once per producer.  I reviewed all the calls to tunnel_run and changed 
many of them from the no argument form (which runs all producers in the tunnel) 
to the single argument form (which only runs the specified producer).  

Ran into cases of multiple HT_STATIC producers being created on the same tunnel 
(100 continue and then the buffer for inactivity timeout).  This caused 
problems because both consumers would get associated with the first producer.  
Since the HT_STATIC producers don't have a real VC, they use the same constant 
value when add_consumer is called.  We may want to rework add consumer to just 
take the matching producer as the argument rather than the producer's VC and 
then look up the producer from the VC.  I fixed that by killing the tunnel to 
clear out the continue 100 producer/consumer pair before adding the other 
HT_STATIC producer.   Not the most lovely solution.  Probably something better 
should be done here.

Then we had a case where HttpSM::kill_this calls 
HttpTunnel::deallocate_buffers.  Deallocate_buffers breaks on an assert because 
HttpTunnel active is still true.  This was being called from an inactivity 
timeout.  Changed the deallocate_buffers call to HttpTunnel::kill_tunnel.  
Since we are shutting everything down, shutting down the tunnel if it is not 
yet shut down seems perfectly reasonable.

I'm going to attach a patch against master.  Will make another patch against 
5.1.0 in the morning.

> Combination of fixes for TS-3084 and TS-3073 causing asserts and segfaults on 
> 5.1 and beyond
> --------------------------------------------------------------------------------------------
>
>                 Key: TS-3105
>                 URL: https://issues.apache.org/jira/browse/TS-3105
>             Project: Traffic Server
>          Issue Type: Bug
>            Reporter: Susan Hinrichs
>            Assignee: Susan Hinrichs
>
> These two patches were run in a production environment on top of 5.0.1 
> without problem for several weeks.  Now running with these patches on top of 
> 5.1 causes either an assert or a segfault.  Another person has reported the 
> same segfault when running master in a production environment.
> In the assert, the handler_state of the producers is 0 (UNKNOWN) rather than 
> a terminal state which is expected.  I'm assuming either we are being 
> directed into the terminal state from a connection that terminates too 
> quickly.  Or an event has hung around for too long and is being executed 
> against the state machine after it has been recycled.
> The event is HTTP_TUNNEL_EVENT_DONE
> The assert stack trace is
> FATAL: HttpSM.cc:2632: failed assert `0`
> /z/bin/traffic_server - STACK TRACE:
> /z/lib/libtsutil.so.5(+0x25197)[0x2b8bd08dc197]
> /z/lib/libtsutil.so.5(+0x23def)[0x2b8bd08dadef]
> /z/bin/traffic_server(HttpSM::tunnel_handler_post_or_put(HttpTunnelProducer*)+0xcd)[0x5982ad]
> /z/bin/traffic_server(HttpSM::tunnel_handler_post(int, void*)+0x86)[0x5a32d6]
> /z/bin/traffic_server(HttpSM::main_handler(int, void*)+0xd8)[0x5a1e18]
> /z/bin/traffic_server(HttpTunnel::main_handler(int, void*)+0xee)[0x5dd6ae]
> /z/bin/traffic_server(write_to_net_io(NetHandler*, UnixNetVConnection*, 
> EThread*)+0x136e)[0x721d1e]
> /z/bin/traffic_server(NetHandler::mainNetEvent(int, Event*)+0x28c)[0x7162fc]
> /z/bin/traffic_server(EThread::process_event(Event*, int)+0x91)[0x744df1]
> /z/bin/traffic_server(EThread::execute()+0x4fc)[0x7458ac]
> /z/bin/traffic_server[0x7440ca]
> /lib64/libpthread.so.0(+0x7034)[0x2b8bd1ee4034]
> /lib64/libc.so.6(clone+0x6d)[0x2b8bd2c2875d]
> The segfault stack trace is 
> /z/bin/traffic_server - STACK TRACE: 
> /lib64/libpthread.so.0(+0xf280)[0x2abccd0d8280]
> /z/bin/traffic_server(HttpSM::tunnel_handler_ua(int, 
> HttpTunnelConsumer*)+0x122)[0x591462]
> /z/bin/traffic_server(HttpTunnel::consumer_handler(int, 
> HttpTunnelConsumer*)+0x9e)[0x5dd15e]
> /z/bin/traffic_server(HttpTunnel::main_handler(int, void*)+0x117)[0x5dd6d7]
> /z/bin/traffic_server(UnixNetVConnection::mainEvent(int, 
> Event*)+0x3f0)[0x725190]
> /z/bin/traffic_server(InactivityCop::check_inactivity(int, 
> Event*)+0x275)[0x716b75]
> /z/bin/traffic_server(EThread::process_event(Event*, int)+0x91)[0x744df1]
> /z/bin/traffic_server(EThread::execute()+0x2fb)[0x7456ab]
> /z/bin/traffic_server[0x7440ca]
> /lib64/libpthread.so.0(+0x7034)[0x2abccd0d0034]
> /lib64/libc.so.6(clone+0x6d)[0x2abccde1475d]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to