[jira] [Commented] (TS-1545) possible crash in records stat snap
[ https://issues.apache.org/jira/browse/TS-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13874896#comment-13874896 ] Leif Hedstrom commented on TS-1545: --- What do we think here? Would it be worthwhile trying to detect this, and nuke the .snap files ? possible crash in records stat snap --- Key: TS-1545 URL: https://issues.apache.org/jira/browse/TS-1545 Project: Traffic Server Issue Type: Bug Components: Core, Stats Affects Versions: 3.3.0 Environment: debug build, long time regression testing Reporter: Zhao Yongming Labels: Crash Fix For: 4.2.0 Attachments: records.snap, stats.snap when I running regression testing in the loop script, we come to fail to start the server. well, the codes may handle the issue when we build without --enable-debug, but that looks like we have something may break records snap. open this issue in case someone need it. {code} [TrafficServer] using root directory '/opt/ats' FATAL: RecMessage.cc:426: failed assert `eh-magic == REC_MESSAGE_ELE_MAGIC` /opt/ats/bin/traffic_server - STACK TRACE: /opt/ats/lib/libtsutil.so.3(ink_fatal_die+0x0)[0x77baeca1] /opt/ats/lib/libtsutil.so.3(_Z12ink_get_randv+0x0)[0x77badbb8] /opt/ats/bin/traffic_server(_Z23RecMessageUnmarshalNextP13RecMessageHdrP13RecMessageItrPP9RecRecord+0xbf)[0x6ed860] /opt/ats/bin/traffic_server(_Z16RecReadStatsFilev+0xc1)[0x6e5e36] /opt/ats/bin/traffic_server(_Z11RecCoreInit8RecModeTP5Diags+0xec)[0x6e254b] /opt/ats/bin/traffic_server(_Z14RecProcessInit8RecModeTP5Diags+0x3b)[0x6e7769] /opt/ats/bin/traffic_server[0x51d4a5] /opt/ats/bin/traffic_server(main+0x1df)[0x51ee39] /lib64/libc.so.6(__libc_start_main+0xed)[0x7515b60d] /opt/ats/bin/traffic_server[0x4d8f99] Program received signal SIGABRT, Aborted. 0x7516ec15 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7516ec15 in raise () from /lib64/libc.so.6 #1 0x7517008b in abort () from /lib64/libc.so.6 #2 0x77baeb2c in ink_die_die_die (retval=1) at ink_error.cc:43 #3 0x77baebfe in ink_fatal_va(int, const char *, typedef __va_list_tag __va_list_tag *) (return_code=1, message_format=0x77bca3e0 %s:%d: failed assert `%s`, ap=0x7fffc8b8) at ink_error.cc:65 #4 0x77baeca1 in ink_fatal (return_code=1, message_format=0x77bca3e0 %s:%d: failed assert `%s`) at ink_error.cc:73 #5 0x77badbb8 in _ink_assert (expression=0x76ffa0 eh-magic == REC_MESSAGE_ELE_MAGIC, file=0x76fe40 RecMessage.cc, line=426) at ink_assert.cc:38 #6 0x006ed860 in RecMessageUnmarshalNext (msg=0xfe6110, itr=0x7fffca00, record=0x7fffca10) at RecMessage.cc:426 #7 0x006e5e36 in RecReadStatsFile () at P_RecCore.i:569 #8 0x006e254b in RecCoreInit (mode_type=RECM_STAND_ALONE, _diags=0xfe5f70) at RecCore.cc:209 #9 0x006e7769 in RecProcessInit (mode_type=RECM_STAND_ALONE, _diags=0xfe5f70) at RecProcess.cc:313 #10 0x0051d4a5 in initialize_process_manager () at Main.cc:413 #11 0x0051ee39 in main (argc=1, argv=0x7fffdd08) at Main.cc:1409 (gdb) f 6 #6 0x006ed860 in RecMessageUnmarshalNext (msg=0xfe6110, itr=0x7fffca00, record=0x7fffca10) at RecMessage.cc:426 426 ink_debug_assert(eh-magic == REC_MESSAGE_ELE_MAGIC); (gdb) l 421 itr-ele_hdr = (RecMessageEleHdr *) ((char *) (msg) + itr-ele_hdr-o_next); 422 itr-next += 1; 423 eh = itr-ele_hdr; 424 } 425 426 ink_debug_assert(eh-magic == REC_MESSAGE_ELE_MAGIC); 427 428 // If the file is corrupt, ignore the the rest of the file. 429 if (eh-magic != REC_MESSAGE_ELE_MAGIC) { 430 Warning(Persistent statistics file records.stat is corrupted. Ignoring the rest of the file\n); (gdb) p eh-magic $1 = 0 (gdb) p REC_MESSAGE_ELE_MAGIC $2 = 4027445261 (gdb) p eh $3 = (RecMessageEleHdr *) 0xff4138 (gdb) p *eh $4 = {magic = 0, o_next = 0} (gdb) f 7 #7 0x006e5e36 in RecReadStatsFile () at P_RecCore.i:569 569 } while (RecMessageUnmarshalNext(m, itr, r) != REC_ERR_FAIL); (gdb) l 564 if (RecMessageUnmarshalFirst(m, itr, r) != REC_ERR_FAIL) { 565 do { 566 if ((r-name == NULL) || (!strlen(r-name))) 567 continue; 568 RecSetRecord(r-rec_type, r-name, r-data_type, (r-data), (r-stat_meta.data_raw), false); 569 } while (RecMessageUnmarshalNext(m, itr, r) != REC_ERR_FAIL); 570 } 571 } 572 573 ink_rwlock_unlock(g_records_rwlock); (gdb) p r $5 = (RecRecord *) 0xff4070 (gdb) p *r $6 = {rec_type = RECT_PROCESS, name = 0xff4118 , data_type = RECD_INT, data = {rec_int = 0, rec_float = 0, rec_string = 0x0, rec_counter = 0},
[jira] [Commented] (TS-1545) possible crash in records stat snap
[ https://issues.apache.org/jira/browse/TS-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13874904#comment-13874904 ] William Bardwell commented on TS-1545: -- I think so, it can leave a machine in a loop where it won't work at all without deleting those files. The current code would need a lot more validation of the data that is being read in. Maybe it needs fancier marshalling out so you can just tell that a file was partially written. (Magic number that is written last, or something similar.) possible crash in records stat snap --- Key: TS-1545 URL: https://issues.apache.org/jira/browse/TS-1545 Project: Traffic Server Issue Type: Bug Components: Core, Stats Affects Versions: 3.3.0 Environment: debug build, long time regression testing Reporter: Zhao Yongming Labels: Crash Fix For: 4.2.0 Attachments: records.snap, stats.snap when I running regression testing in the loop script, we come to fail to start the server. well, the codes may handle the issue when we build without --enable-debug, but that looks like we have something may break records snap. open this issue in case someone need it. {code} [TrafficServer] using root directory '/opt/ats' FATAL: RecMessage.cc:426: failed assert `eh-magic == REC_MESSAGE_ELE_MAGIC` /opt/ats/bin/traffic_server - STACK TRACE: /opt/ats/lib/libtsutil.so.3(ink_fatal_die+0x0)[0x77baeca1] /opt/ats/lib/libtsutil.so.3(_Z12ink_get_randv+0x0)[0x77badbb8] /opt/ats/bin/traffic_server(_Z23RecMessageUnmarshalNextP13RecMessageHdrP13RecMessageItrPP9RecRecord+0xbf)[0x6ed860] /opt/ats/bin/traffic_server(_Z16RecReadStatsFilev+0xc1)[0x6e5e36] /opt/ats/bin/traffic_server(_Z11RecCoreInit8RecModeTP5Diags+0xec)[0x6e254b] /opt/ats/bin/traffic_server(_Z14RecProcessInit8RecModeTP5Diags+0x3b)[0x6e7769] /opt/ats/bin/traffic_server[0x51d4a5] /opt/ats/bin/traffic_server(main+0x1df)[0x51ee39] /lib64/libc.so.6(__libc_start_main+0xed)[0x7515b60d] /opt/ats/bin/traffic_server[0x4d8f99] Program received signal SIGABRT, Aborted. 0x7516ec15 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7516ec15 in raise () from /lib64/libc.so.6 #1 0x7517008b in abort () from /lib64/libc.so.6 #2 0x77baeb2c in ink_die_die_die (retval=1) at ink_error.cc:43 #3 0x77baebfe in ink_fatal_va(int, const char *, typedef __va_list_tag __va_list_tag *) (return_code=1, message_format=0x77bca3e0 %s:%d: failed assert `%s`, ap=0x7fffc8b8) at ink_error.cc:65 #4 0x77baeca1 in ink_fatal (return_code=1, message_format=0x77bca3e0 %s:%d: failed assert `%s`) at ink_error.cc:73 #5 0x77badbb8 in _ink_assert (expression=0x76ffa0 eh-magic == REC_MESSAGE_ELE_MAGIC, file=0x76fe40 RecMessage.cc, line=426) at ink_assert.cc:38 #6 0x006ed860 in RecMessageUnmarshalNext (msg=0xfe6110, itr=0x7fffca00, record=0x7fffca10) at RecMessage.cc:426 #7 0x006e5e36 in RecReadStatsFile () at P_RecCore.i:569 #8 0x006e254b in RecCoreInit (mode_type=RECM_STAND_ALONE, _diags=0xfe5f70) at RecCore.cc:209 #9 0x006e7769 in RecProcessInit (mode_type=RECM_STAND_ALONE, _diags=0xfe5f70) at RecProcess.cc:313 #10 0x0051d4a5 in initialize_process_manager () at Main.cc:413 #11 0x0051ee39 in main (argc=1, argv=0x7fffdd08) at Main.cc:1409 (gdb) f 6 #6 0x006ed860 in RecMessageUnmarshalNext (msg=0xfe6110, itr=0x7fffca00, record=0x7fffca10) at RecMessage.cc:426 426 ink_debug_assert(eh-magic == REC_MESSAGE_ELE_MAGIC); (gdb) l 421 itr-ele_hdr = (RecMessageEleHdr *) ((char *) (msg) + itr-ele_hdr-o_next); 422 itr-next += 1; 423 eh = itr-ele_hdr; 424 } 425 426 ink_debug_assert(eh-magic == REC_MESSAGE_ELE_MAGIC); 427 428 // If the file is corrupt, ignore the the rest of the file. 429 if (eh-magic != REC_MESSAGE_ELE_MAGIC) { 430 Warning(Persistent statistics file records.stat is corrupted. Ignoring the rest of the file\n); (gdb) p eh-magic $1 = 0 (gdb) p REC_MESSAGE_ELE_MAGIC $2 = 4027445261 (gdb) p eh $3 = (RecMessageEleHdr *) 0xff4138 (gdb) p *eh $4 = {magic = 0, o_next = 0} (gdb) f 7 #7 0x006e5e36 in RecReadStatsFile () at P_RecCore.i:569 569 } while (RecMessageUnmarshalNext(m, itr, r) != REC_ERR_FAIL); (gdb) l 564 if (RecMessageUnmarshalFirst(m, itr, r) != REC_ERR_FAIL) { 565 do { 566 if ((r-name == NULL) || (!strlen(r-name))) 567 continue; 568 RecSetRecord(r-rec_type, r-name, r-data_type, (r-data), (r-stat_meta.data_raw), false); 569 } while (RecMessageUnmarshalNext(m, itr, r) != REC_ERR_FAIL); 570 } 571 }