[ 
https://issues.apache.org/jira/browse/TS-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583311#comment-15583311
 ] 

Thomas Jackson commented on TS-4970:
------------------------------------

after spending some time looking at it-- [~shinrich] is correct, I am mis-using 
the m_deleted field. Taking a step back, my primary issue with this bug is that 
we are double-freeing something. From looking at the diff of TS-4590 I see that 
the mechanism for doing this is by ussing the `m_free_magic` member, but that 
is a bit broken in 5.2 and 6.2. So, instead of the current patch I have I am 
changing to use the mutex pointer as the "are we deleted" flag to avoid double 
freeing. I've updated the PRs, and will ping back once I've finished rolling 
out the change on our infra.

> Crash in INKVConnInternal when handle_event is called after destroy()
> ---------------------------------------------------------------------
>
>                 Key: TS-4970
>                 URL: https://issues.apache.org/jira/browse/TS-4970
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: HTTP
>            Reporter: Thomas Jackson
>            Assignee: Thomas Jackson
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> We've noticed a few crashes for requests using SPDY (on ATS 5.2.x and 6..x) 
> where the downstream origin is down with a backtrace that looks something 
> like:
> {code}
> (gdb) bt
> #0  0x0000000000000000 in ?? ()
> #1  0x00000000004cfe54 in set_continuation (this=0x2afe63a93530, event=1, 
>     edata=0x2afe6399fc40) at ../iocore/eventsystem/P_VIO.h:104
> #2  INKVConnInternal::handle_event (this=0x2afe63a93530, event=1, 
>     edata=0x2afe6399fc40) at InkAPI.cc:1060
> #3  0x00000000006f8e65 in handleEvent (this=0x2afe3dd07000, e=0x2afe6399fc40, 
>     calling_code=1) at I_Continuation.h:146
> #4  EThread::process_event (this=0x2afe3dd07000, e=0x2afe6399fc40, 
>     calling_code=1) at UnixEThread.cc:144
> #5  0x00000000006f993b in EThread::execute (this=0x2afe3dd07000)
>     at UnixEThread.cc:195
> #6  0x00000000006f832a in spawn_thread_internal (a=0x2afe3badf400)
>     at Thread.cc:88
> #7  0x0000003861c079d1 in start_thread () from /lib64/libpthread.so.0
> #8  0x00000038614e8b5d in clone () from /lib64/libc.so.6
> {code}
> Which looks a bit odd-- as frame 0 is missing. From digging into it a bit 
> more (with the help of [~amc]) we found that the VC we where calling was an 
> INKContInternal (meaning an INKVConnInternal):
> {code}
> (gdb) p (INKVConnInternal) *vc_server
> $5 = {<INKContInternal> = {<DummyVConnection> = {<VConnection> = 
> {<Continuation> = {<force_VFPT_to_top> = {_vptr.force_VFPT_to_top = 
> 0x2afe63a93170}, 
>           handler = (int (Continuation::*)(Continuation *, int, 
>     void *)) 0x4cfd90 <INKVConnInternal::handle_event(int, void*)>, mutex = {
>             m_ptr = 0x0}, link = {<SLink<Continuation>> = {next = 0x0}, 
>             prev = 0x0}}, lerrno = 20600}, <No data fields>}, 
>     mdata = 0xdeaddead, m_event_func = 0x2afe43c18490
>      <(anonymous namespace)::handleTransformationPluginEvents(TSCont, 
> TSEvent, void*)>, m_event_count = 0, m_closed = -1, m_deletable = 1, 
> m_deleted = 1, 
>     m_free_magic = INKCONT_INTERN_MAGIC_ALIVE}, m_read_vio = {_cont = 0x0, 
>     nbytes = 0, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
>     vc_server = 0x0, mutex = {m_ptr = 0x0}}, m_write_vio = {_cont = 0x0, 
>     nbytes = 122, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
>     vc_server = 0x2afe63a93530, mutex = {m_ptr = 0x0}}, 
>   m_output_vc = 0x2afe63091a88}
> {code}
> From looking at the debug logs that lead up to the crash, I'm seeing that 
> some events (namely timeout events) are being called after the VConn has been 
> destroy()'d . This lead me to find that INKVConnInternal::handle_event is 
> actually checking if that is the case-- and then re-destroying everything, 
> which makes no sense.
> So although the ideal would be to not call handle_event on a closed VConn, 
> crashing is definitely not acceptable. My solution is to continue to only 
> call the event handler if the VConn hasn't been deleted-- but instead of 
> attempting to re-destroy the connection, we'll leave it be (unless we are in 
> debug mode-- where I'll throw in an assert).
> I did some looking at this on ATS7 and it looks like this is all fixed by the 
> cleanup of the whole free-ing stuff for VConns 
> (https://github.com/apache/trafficserver/pull/752/files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to