[ https://issues.apache.org/jira/browse/TS-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723962#comment-13723962 ]
Yunkai Zhang commented on TS-1330: ---------------------------------- I have found the root cause of this issue, here is my analysis: 1. One thread(A) got an LogBuffer object in LogObject::_checkout_write() and ran into LogBuffer::checkout_write(), before it updated the LogBuffer::m_state, it was scheduled out by the kernel. 2. Other threads got the same LogBuffer object, but they *all* finished _checkout_write() codes and made the LogBuffer object become *full*, finally one of these thread flushed the LogBuffer object into write_list queue. 3. The LogBuffer object's m_references == 0 at this time. 4. The flushing thread would call LogBufferManager::flush_buffers() to loop write_list. As the LogBuffer's m_references == 0, so it would be flushed and *deleted*. 5. Now, thread(A) would be scheduled in by the kernel, but the LogBuffer object has been *deleted*, so the m_buffer == NULL, that is why entry_header is 0x0. Here is a coredump generated by the my testing patch - "0001-A-testing-patch-to-verify-my-analysis.patch": {code} (gdb) bt #0 0x0000003c75232885 in raise () from /lib64/libc.so.6 #1 0x0000003c75234065 in abort () from /lib64/libc.so.6 #2 0x00002b02d24bdaf8 in ink_die_die_die (retval=1) at ink_error.cc:43 #3 0x00002b02d24bdbca in ink_fatal_va(int, const char *, typedef __va_list_tag __va_list_tag *) (return_code=1, message_format=0x2b02d24dbe40 "%s:%d: failed assert `%s`", ap=0x2b0304211a90) at ink_error.cc:65 #4 0x00002b02d24bdc95 in ink_fatal (return_code=1, message_format=0x2b02d24dbe40 "%s:%d: failed assert `%s`") at ink_error.cc:73 #5 0x00002b02d24bcbc4 in _ink_assert (expression=0x74e021 "current_ref == 0", file=0x74e014 "LogObject.cc", line=60) at ink_assert.cc:38 #6 0x000000000060a9ab in LogBufferManager::flush_buffers (this=0x3219a08, sink=0x32199a0) at LogObject.cc:60 #7 0x00000000005f755e in LogObject::flush_buffers (this=0x3219970) at LogObject.h:126 #8 0x000000000060de55 in LogObjectManager::flush_buffers (this=0x31e84f0) at LogObject.cc:1185 #9 0x00000000005f6627 in Log::flush_thread_main (args=0x0) at Log.cc:1193 #10 0x00000000005f779b in LoggingFlushContinuation::mainEvent (this=0x31e8fb0, event=1, data=0x2b02d8e13c60) at Log.cc:297 #11 0x00000000004f1f1e in Continuation::handleEvent (this=0x31e8fb0, event=1, data=0x2b02d8e13c60) at ../iocore/eventsystem/I_Continuation.h:146 #12 0x000000000070fb18 in EThread::execute (this=0x2b0304013010) at UnixEThread.cc:291 #13 0x000000000070e6df in spawn_thread_internal (a=0x31e83d0) at Thread.cc:88 #14 0x0000003c756077f1 in start_thread () from /lib64/libpthread.so.0 #15 0x0000003c752e570d in clone () from /lib64/libc.so.6 (gdb) f 6 #6 0x000000000060a9ab in LogBufferManager::flush_buffers (this=0x3219a08, sink=0x32199a0) at LogObject.cc:60 60 ink_release_assert(current_ref == 0); (gdb) l 55 * m_ref_debug could be changed by other threads 56 * currently, let's store it into local variable 57 * which is helpful for debugging. 58 */ 59 int current_ref = b->m_ref_debug; 60 ink_release_assert(current_ref == 0); 61 new_q.push(b); 62 } 63 } 64 (gdb) p current_ref $2 = 2 (gdb) p b->m_references $3 = 0 (gdb) {code} Once we found out the root cause, fix is simple. We should refine the LogBuffer::m_references variaable, so that it can *really* prevent outstanding references. > Logging related segfault in 3.2.0 > --------------------------------- > > Key: TS-1330 > URL: https://issues.apache.org/jira/browse/TS-1330 > Project: Traffic Server > Issue Type: Bug > Components: Logging > Affects Versions: 3.2.0 > Environment: ATS 3.2.0 on RHEL 6.2 64-bit > Reporter: David Carlin > Assignee: Bryan Call > Priority: Critical > Labels: A, crash > Fix For: 3.5.0 > > > I observed the following crash once on one of our ATS boxes - possibly > related to TS-1240 > Jul 2 13:56:56 l2 traffic_server[25853]: {0x2b0a391e1700} ERROR: > [SSL_NetVConnection::ssl_read_from_net] SSL_ERROR_SYSCALL, underlying IO > error: Connection reset by peer > Jul 2 13:59:56 l2 kernel: [ET_NET 1][25855]: segfault at c ip > 000000000058e083 sp 00002b0a2982b740 error 6 > Jul 2 13:59:56 l2 kernel: [ET_NET 3][25857]: segfault at 84 ip > 000000000058e083 sp 00002b0a29a31740 error 6 in traffic_server[400000+340000] > Jul 2 13:59:56 l2 kernel: in traffic_server[400000+340000] > Jul 2 14:02:59 l2 traffic_cop[25901]: (test) read timeout [180000 ] > Jul 2 14:02:59 l2 traffic_cop[25901]: server heartbeat failed [1] > Jul 2 14:03:08 l2 traffic_manager[25826]: {0x7f3f088607e0} FATAL: > [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) > Jul 2 14:03:09 l2 traffic_manager[25826]: {0x7f3f088607e0} FATAL: (last > system error 104: Connection reset by peer) > Jul 2 14:03:09 l2 traffic_cop[25901]: cannot find traffic_server [1] > Jul 2 14:03:09 l2 traffic_manager[25826]: {0x7f3f088607e0} ERROR: > [LocalManager::sendMgmtMsgToProcesses] Error writing message > Jul 2 14:03:09 l2 traffic_manager[25826]: {0x7f3f088607e0} ERROR: (last > system error 32: Broken pipe) > Jul 2 14:03:12 l2 traffic_cop[25901]: cop received child status signal > [25826 35584] > Jul 2 14:03:12 l2 traffic_cop[25901]: traffic_manager not running, making > sure traffic_server is dead > Jul 2 14:03:12 l2 traffic_cop[25901]: spawning traffic_manager > Jul 2 14:03:13 l2 traffic_manager[18267]: NOTE: --- Manager Starting --- > Jul 2 14:03:13 l2 traffic_manager[18267]: NOTE: Manager Version: Apache > Traffic Server - traffic_manager - 3.2.0 - (build # 52518 on Jun 25 2012 at > 18:22:12) > Jul 2 14:03:13 l2 traffic_manager[18267]: {0x7fe63de3f7e0} STATUS: opened > /home/y/logs/trafficserver/manager.log > Jul 2 14:03:15 l2 traffic_server[18322]: NOTE: --- Server Starting --- > Jul 2 14:03:15 l2 traffic_server[18322]: NOTE: Server Version: Apache > Traffic Server - traffic_server - 3.2.0 - (build # 52518 on Jun 25 2012 at > 18:22:31) > Jul 2 14:03:15 l2 traffic_server[18322]: {0x2b77573ab860} STATUS: opened > /home/y/logs/trafficserver/diags.log > Jul 2 14:03:15 l2 traffic_server[18322]: {0x2b77573ab860} ERROR: Cannot > insert duplicate! > Jul 2 14:03:22 l2 traffic_cop[25901]: server heartbeat succeeded > [Jul 2 13:56:56.304] Server {0x2b0a391e1700} ERROR: > [SSL_NetVConnection::ssl_read_from_net] SSL_ERROR_SYSCALL, underlying IO > error: Connection reset by peer > NOTE: Traffic Server received Sig 11: Segmentation fault > NOTE: Traffic Server received Sig 11: Segmentation fault > /home/y/bin/traffic_server - STACK TRACE: > /home/y/bin/traffic_server - STACK TRACE: > /lib64/libpthread.so.0[0x3b54e0f4a0] > /lib64/libpthread.so.0[0x3b54e0f4a0] > /home/y/bin/traffic_server(_ZN9LogBuffer14checkout_writeEPmm+0x153)[0x58e083] > /home/y/bin/traffic_server(_ZN9LogObject15_checkout_writeEPmm+0xa8)[0x5a64c8] > /home/y/bin/traffic_server(_ZN9LogObject3logEP9LogAccessPc+0x2f0)[0x5a7e30] > /home/y/bin/traffic_server(_ZN9LogBuffer14checkout_writeEPmm+0x153)[0x58e083] > /home/y/bin/traffic_server(_ZN9LogObject15_checkout_writeEPmm+0xa8)[0x5a64c8] > /home/y/bin/traffic_server(_ZN9LogObject3logEP9LogAccessPc+0x2f0)[0x5a7e30] > /home/y/bin/traffic_server(_ZN3Log6accessEP9LogAccess+0x146)[0x58f506] > /home/y/bin/traffic_server(_ZN6HttpSM12update_statsEv+0x630)[0x526c50] > /home/y/bin/traffic_server(_ZN3Log6accessEP9LogAccess+0x146)[0x58f506] > /home/y/bin/traffic_server(_ZN6HttpSM9kill_thisEv+0x928)[0x52b548] > /home/y/bin/traffic_server(_ZN6HttpSM12update_statsEv+0x630)[0x526c50] > /home/y/bin/traffic_server(_ZN6HttpSM9kill_thisEv+0x928)[0x52b548] > /home/y/bin/traffic_server(_ZN6HttpSM12main_handlerEiPv+0x198)[0x52b868] > /home/y/bin/traffic_server(_ZN10HttpTunnel12main_handlerEiPv+0xde)[0x56c3ee] > /home/y/bin/traffic_server[0x6736a1] > /home/y/bin/traffic_server(_ZN6HttpSM12main_handlerEiPv+0x198)[0x52b868] > /home/y/bin/traffic_server(_Z15write_to_net_ioP10NetHandlerP18UnixNetVConnectionP7EThread+0x847)[0x675517] > /home/y/bin/traffic_server[0x672f81] > /home/y/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x286)[0x66df96] > /home/y/bin/traffic_server[0x676ad5] > /home/y/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0xb4)[0x696b14] > /home/y/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x1f2)[0x66df02] > /home/y/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0xb4)[0x696b14] > /home/y/bin/traffic_server(_ZN7EThread7executeEv+0x4c3)[0x6974a3] > /home/y/bin/traffic_server[0x695ae2] > /lib64/libpthread.so.0[0x3b54e077f1] > /lib64/libc.so.6(clone+0x6d)[0x3b54ae5ccd] > /home/y/bin/traffic_server(_ZN7EThread7executeEv+0x4c3)[0x6974a3] > /home/y/bin/traffic_server[0x695ae2] > /lib64/libpthread.so.0[0x3b54e077f1] > /lib64/libc.so.6(clone+0x6d)[0x3b54ae5ccd] > [Jul 2 14:03:08.684] Manager {0x7f3f088607e0} FATAL: > [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) > [Jul 2 14:03:09.213] Manager {0x7f3f088607e0} FATAL: (last system error > 104: Connection reset by peer) > [Jul 2 14:03:09.213] Manager {0x7f3f088607e0} NOTE: > [LocalManager::mgmtShutdown] Executing shutdown request. > [Jul 2 14:03:09.213] Manager {0x7f3f088607e0} NOTE: > [LocalManager::processShutdown] Executing process shutdown request. > [Jul 2 14:03:09.496] Manager {0x7f3f088607e0} ERROR: > [LocalManager::sendMgmtMsgToProcesses] Error writing message > [Jul 2 14:03:09.496] Manager {0x7f3f088607e0} ERROR: (last system error 32: > Broken pipe) > [example_alarm_bin.sh] sent alarm: l2.ycs.uls.yahoo.com [Mon Jul 2 14:03:09 > 2012] The TS-TM connection is broken for some reason. Either restart TS and > TM or correct this error for TM to display TS statistics correctly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira