[ https://issues.apache.org/jira/browse/TS-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15548770#comment-15548770 ]
Gancho Tenev commented on TS-4916: ---------------------------------- Looked into this more and here are my findings / hypothesis. *Studied the list code (lib/ts/List.h)* and found a couple of problems (filed [TS-4935|https://issues.apache.org/jira/browse/TS-4935]) If the list is used improperly its internal structure gets damaged silently. If the the same element is added twice in a row the element’s “next” would start pointing to the element itself and all the pre-existing list content would be lost. All further additions will be OK but the next list traversal will be infinite. *How would we add the same element twice?* Since a memory pool is used to instantiate the streams it is not impossible to have exactly the same chunk returned by the pool. *How could adding the same chunk happen?* # a stream N is created and used, no new streams are created in the meanwhile # the stream N is closed and its memory chunk is released back to the pool when the stream is destroyed. # fail to remove stream N from the list of active streams (bug!) # a new stream N+1 is created right after destroying stream N (and getting exactly the same memory chunk from the memory pool used by stream N) # add the new stream N+1 to the list of active streams, in this way adding the same memory chunk to the list for a second time in a row and damaging the list’s internal structure # new streams can be added and deleted after this point but the next the active stream list iteration will be infinite. *Hypothesis validation* By the time we identify the infinite loop (which is pretty straight forward) all the useful info about how we got to this state is gone so in order to validate this hypothesis I had to collect some more data. Instrumented the code and added a check for failures to remove the stream from the list of active streams right before destroying it and in case of failure removed it (as a “catch all” safety net just before destroying) It usually took 1-3 days to reach the infinite loop/deadlock after restart. Run the experimental code for 3+ days without getting into the infinite loop / deadlock state and the collected data indicates that it would have failed to remove the stream from the active stream list (which would trigger the infinite loop state) 4 times during that period. I believe this validates the hypothesis. *Next steps* Identified an execution path which could fail to remove the element from the list before destroying the stream. Implemented a patch which I just started testing in prod and will provide update as soon as I validate it. Please let me know if more info is needed or something does not make sense and will be happy to look into it! Cheers, --Gancho > Http2ConnectionState::restart_streams infinite loop causes deadlock > -------------------------------------------------------------------- > > Key: TS-4916 > URL: https://issues.apache.org/jira/browse/TS-4916 > Project: Traffic Server > Issue Type: Bug > Components: Core, HTTP/2 > Reporter: Gancho Tenev > Assignee: Gancho Tenev > Priority: Blocker > Fix For: 7.1.0 > > > Http2ConnectionState::restart_streams falls into an infinite loop while > holding a lock, which leads to cache updates to start failing. > The infinite loop is caused by traversing a list whose last element “next” > points to the element itself and the traversal never finishes. > {code} > Thread 51 (Thread 0x2aaab3d04700 (LWP 34270)): > #0 0x00002aaaaacf3fee in Http2ConnectionState::restart_streams > (this=0x2ae6ba5284c8) at Http2ConnectionState.cc:913 > #1 rcv_window_update_frame (cstate=..., frame=...) at > Http2ConnectionState.cc:627 > #2 0x00002aaaaacf9738 in Http2ConnectionState::main_event_handler > (this=0x2ae6ba5284c8, event=<optimized out>, edata=<optimized out>) at > Http2ConnectionState.cc:823 > #3 0x00002aaaaacef1c3 in Continuation::handleEvent (data=0x2aaab3d039a0, > event=2253, this=0x2ae6ba5284c8) at > ../../iocore/eventsystem/I_Continuation.h:153 > #4 send_connection_event (cont=cont@entry=0x2ae6ba5284c8, > event=event@entry=2253, edata=edata@entry=0x2aaab3d039a0) at > Http2ClientSession.cc:58 > #5 0x00002aaaaacef462 in Http2ClientSession::state_complete_frame_read > (this=0x2ae6ba528290, event=<optimized out>, edata=0x2aab7b237f18) at > Http2ClientSession.cc:426 > #6 0x00002aaaaacf0982 in Continuation::handleEvent (data=0x2aab7b237f18, > event=100, this=0x2ae6ba528290) at > ../../iocore/eventsystem/I_Continuation.h:153 > #7 Http2ClientSession::state_start_frame_read (this=0x2ae6ba528290, > event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:399 > #8 0x00002aaaaacef5a3 in Continuation::handleEvent (data=0x2aab7b237f18, > event=100, this=0x2ae6ba528290) at > ../../iocore/eventsystem/I_Continuation.h:153 > #9 Http2ClientSession::state_complete_frame_read (this=0x2ae6ba528290, > event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:431 > #10 0x00002aaaaacf0982 in Continuation::handleEvent (data=0x2aab7b237f18, > event=100, this=0x2ae6ba528290) at > ../../iocore/eventsystem/I_Continuation.h:153 > #11 Http2ClientSession::state_start_frame_read (this=0x2ae6ba528290, > event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:399 > #12 0x00002aaaaae67e2b in Continuation::handleEvent (data=0x2aab7b237f18, > event=100, this=<optimized out>) at > ../../iocore/eventsystem/I_Continuation.h:153 > #13 read_signal_and_update (vc=0x2aab7b237e00, vc@entry=0x1, > event=event@entry=100) at UnixNetVConnection.cc:153 > #14 UnixNetVConnection::readSignalAndUpdate (this=this@entry=0x2aab7b237e00, > event=event@entry=100) at UnixNetVConnection.cc:1036 > #15 0x00002aaaaae47653 in SSLNetVConnection::net_read_io > (this=0x2aab7b237e00, nh=0x2aaab2409cc0, lthread=0x2aaab2406000) at > SSLNetVConnection.cc:595 > #16 0x00002aaaaae5558c in NetHandler::mainNetEvent (this=0x2aaab2409cc0, > event=<optimized out>, e=<optimized out>) at UnixNet.cc:513 > #17 0x00002aaaaae8d2e6 in Continuation::handleEvent (data=0x2aaab0bfa700, > event=5, this=<optimized out>) at I_Continuation.h:153 > #18 EThread::process_event (calling_code=5, e=0x2aaab0bfa700, > this=0x2aaab2406000) at UnixEThread.cc:148 > #19 EThread::execute (this=0x2aaab2406000) at UnixEThread.cc:275 > #20 0x00002aaaaae8c0e6 in spawn_thread_internal (a=0x2aaab0b25bb0) at > Thread.cc:86 > #21 0x00002aaaad6b3aa1 in start_thread (arg=0x2aaab3d04700) at > pthread_create.c:301 > #22 0x00002aaaae8bc93d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 > {code} > Here is the stream_list trace. > {code} > (gdb) thread 51 > [Switching to thread 51 (Thread 0x2aaab3d04700 (LWP 34270))] > #0 0x00002aaaaacf3fee in Http2ConnectionState::restart_streams > (this=0x2ae6ba5284c8) at Http2ConnectionState.cc:913 > (gdb) trace_list stream_list > ------- count=0 ------- > id=29 > this=0x2ae673f0c840 > next=0x2aaac05d8900 > prev=(nil) > ------- count=1 ------- > id=27 > this=0x2aaac05d8900 > next=0x2ae5b6bbec00 > prev=0x2ae673f0c840 > ------- count=2 ------- > id=19 > this=0x2ae5b6bbec00 > next=0x2ae5b6bbec00 > prev=0x2aaac05d8900 > ------- count=3 ------- > id=19 > this=0x2ae5b6bbec00 > next=0x2ae5b6bbec00 > prev=0x2aaac05d8900 > . . . > ------- count=5560 ------- > id=19 > this=0x2ae5b6bbec00 > next=0x2ae5b6bbec00 > prev=0x2aaac05d8900 > . . . > {code} > Currently I am working on finding out why the list in question got into this > “impossible” (broken) state and and eventually coming up with a fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)