Bugs item #3390207, was opened at 2011-08-11 22:25 Message generated for change (Settings changed) made by bogdan_iancu You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3390207&group_id=232389
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: modules Group: trunk Status: Open Resolution: None Priority: 5 Private: No Submitted By: Muhammad Shahzad (shari_786pk) >Assigned to: Vladut-Stefan Paiu (vladut-paiu) Summary: Segfault while freeing dialog Initial Comment: Hi, OpenSIPs crashes randomly while processing Bye reply, when destroying / freeing the associated dialog. So far, this has happened 4 times in last 2 days on one of my production servers. There are two core dumps generated upon each crash, BT on first core dump gives following, Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid -m 512 -u root -g root'. Program terminated with signal 11, Segmentation fault. #0 0x97761500 in ?? () (gdb) bt #0 0x97761500 in ?? () #1 0xb723abd3 in free_dlg_dlg (dlg=<value optimized out>) at dlg_hash.c:163 #2 0xb7243f5f in destroy_dlg (dlg=0x979d04a0, cnt=1) at dlg_hash.c:225 #3 unref_dlg (dlg=0x979d04a0, cnt=1) at dlg_hash.c:742 #4 0xb7247a42 in dual_bye_event (dlg=0x979d04a0, req=<value optimized out>, extra_unref=1) at dlg_req_within.c:274 #5 0xb72491c0 in bye_reply_cb (t=0x976af43c, type=256, ps=0xbfddce54) at dlg_req_within.c:294 #6 0xb728e657 in run_trans_callbacks (type=256, trans=0x976af43c, req=0x0, rpl=0x82d241c, code=200) at t_hooks.c:212 #7 0xb72a6d93 in local_reply (t=0x976af43c, p_msg=0x82d241c, branch=0, msg_status=200, cancel_bitmap=0xbfddcf88) at t_reply.c:1358 #8 0xb72a82c2 in reply_received (p_msg=0x82d241c) at t_reply.c:1503 #9 0x0806c913 in forward_reply (msg=0x82d241c) at forward.c:568 #10 0x080a4d32 in receive_msg ( buf=0x81d7480 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 77.66.2.136;branch=z9hG4bK2aa.63be8597.0;received=77.66.2.136\r\nFrom: <sip:+642102908458@77.66.2.136>;tag=as2bde889d\r\nTo: <sip:01100165022863605@77.66.2.136:5060;user=p"..., len=428, rcv_info=<value optimized out>) at receive.c:203 #11 0x080fbd5e in udp_rcv_loop () at udp_server.c:419 #12 0x08074d28 in main_loop (argc=9, argv=0xbfddd2f4) at main.c:879 #13 main (argc=9, argv=0xbfddd2f4) at main.c:1497 BT on second core dump gives following, Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid -m 512 -u root -g root'. Program terminated with signal 6, Aborted. #0 0xb78ab424 in __kernel_vsyscall () (gdb) bt #0 0xb78ab424 in __kernel_vsyscall () #1 0xb75c3751 in raise () from /lib/i686/cmov/libc.so.6 #2 0xb75c6b82 in abort () from /lib/i686/cmov/libc.so.6 #3 0x08070d64 in sig_alarm_abort (signo=14) at main.c:435 #4 <signal handler called> #5 get_lock (ticks=0, param=0x0) at ../../evi/../fastlock.h:178 #6 dialog_update_db (ticks=0, param=0x0) at dlg_db_handler.c:972 #7 0xb72249e1 in mod_destroy () at dialog.c:755 #8 0x080ce108 in destroy_modules () at sr_module.c:371 #9 0x08070858 in cleanup (show_status=1) at main.c:344 #10 0x080714a9 in handle_sigs () at main.c:545 #11 0x08075245 in main_loop (argc=9, argv=0xbfddd2f4) at main.c:990 #12 main (argc=9, argv=0xbfddd2f4) at main.c:1497 The most disturbing part of these crashes for me is that it causes all running calls to hang up. Previously in case of OpenSIPs crash running calls use to remain unaffected. I am using SVN trunk revision 8215, and have create_dialog("BPp") on very first INVITE in my dial plan. Feel free to ask for more info as needed. Thank you. ---------------------------------------------------------------------- Comment By: Muhammad Shahzad (shari_786pk) Date: 2011-08-18 09:44 Message: Since, my production machine is stable now so, i won't risk it to crash again. However, i do have a production clone, with exact same configuration as production. Let me put my team on it and see if they could reproduce it there while debug logs and sip traces are enabled. Thank you. ---------------------------------------------------------------------- Comment By: Vladut-Stefan Paiu (vladut-paiu) Date: 2011-08-17 19:55 Message: Removing the "PpB" parameters "solved" your problem because that's where the bug originated. OpenSIPS encounters a double free while processing the reply to a 200 Ok to a BYE generated internally by OpenSIPS. Which can be the case either when pinging is enabled and one endpoint does not respond in time, or when the overall dialog time limit is exceeded. I have failed to reproduce your scenario, so It would help a lot if you could either provide a full debug log, or if that isn't possible, maybe a full SIP trace for the call that makes OpenSIPS crash. And yes, the load_balancer issue seems like a different issue. Please open a separate bug report for this. ---------------------------------------------------------------------- Comment By: Muhammad Shahzad (shari_786pk) Date: 2011-08-17 10:14 Message: I have removed all parameters from create_dialog function and system seems bit stable now. Though i am still watching for crash. However, this has created a new problem, that is, i am also using load_balancer module which is relaying calls to my media servers array. Now the problem is that whenever i execute lb_list command via any management interface i get high number of calls on each media server, while checking the media servers i get much less number of running calls. Which implies something is wrong with load_balancer module stats. I wonder what will happen if this wrong call count reaches to call capacity mentioned in load_balancer db table? Should i open a separate bug for this problem? Thank you. ---------------------------------------------------------------------- Comment By: Muhammad Shahzad (shari_786pk) Date: 2011-08-15 20:37 Message: couldn't found any thing matching "bogus ref" but do find a critical warning matching "bogus event", just before opensips crash. CRITICAL:dialog:log_next_state_dlg: bogus event 7 in state 2 for dlg 0x970b8874 [138:1657090481] with clid '584da4f976b14ce053d3c9345a09391e@77.66.2.136' and tags 'as6fd68670' 'SDopvfa99-9302873325129137220' CRITICAL:core:receive_fd: EOF on 37 NOTICE:presence:destroy: destroy module ... CRITICAL:core:sig_alarm_abort: BUG - shutdown timeout triggered, dying... ---------------------------------------------------------------------- Comment By: Vladut-Stefan Paiu (vladut-paiu) Date: 2011-08-15 18:26 Message: Hello, Could you please check the logs for critical messages saying "bogus ref" on the calls that you experience the crashes ? ---------------------------------------------------------------------- Comment By: Muhammad Shahzad (shari_786pk) Date: 2011-08-13 16:48 Message: It crashed again. This time with revision 8201. BT indicates same problem. Here is BT full. #0 0x97879f00 in ?? () No symbol table info available. #1 0xb71ec703 in free_dlg_dlg (dlg=<value optimized out>) at dlg_hash.c:163 i = 0 __FUNCTION__ = "free_dlg_dlg" #2 0xb71f5a7f in destroy_dlg (dlg=0x9784ae04, cnt=1) at dlg_hash.c:225 ret = <value optimized out> #3 unref_dlg (dlg=0x9784ae04, cnt=1) at dlg_hash.c:742 d_entry = 0x970a8c50 __FUNCTION__ = "unref_dlg" #4 0xb71f9562 in dual_bye_event (dlg=0x9784ae04, req=<value optimized out>, extra_unref=1) at dlg_req_within.c:274 old_state = 5 new_state = 5 unref = 1 ret = <value optimized out> __FUNCTION__ = "dual_bye_event" #5 0xb71face0 in bye_reply_cb (t=0x97760f68, type=256, ps=0xbfd4f9f4) at dlg_req_within.c:294 __FUNCTION__ = "bye_reply_cb" #6 0xb7240657 in run_trans_callbacks (type=256, trans=0x97760f68, req=0x0, rpl=0x82d23b4, code=200) at t_hooks.c:212 params = {req = 0x0, rpl = 0x82d23b4, code = 200, param = 0x97451d2c, extra1 = 0x0, extra2 = 0x0} cbp = 0x97451d20 backup = 0x81e7488 trans_backup = 0x97760f68 __FUNCTION__ = "run_trans_callbacks" #7 0xb7258d93 in local_reply (t=0x97760f68, p_msg=0x82d23b4, branch=0, msg_status=200, cancel_bitmap=0xbfd4fb28) at t_reply.c:1358 local_store = 0 local_winner = 0 reply_status = RPS_COMPLETED winning_msg = 0x974278c0 winning_code = 200 totag_retr = 0 __FUNCTION__ = "local_reply" #8 0xb725a2c2 in reply_received (p_msg=0x82d23b4) at t_reply.c:1503 last_uac_status = <value optimized out> branch = 0 reply_status = <value optimized out> timer = <value optimized out> cancel_bitmap = 0 t = 0x97760f68 backup_list = 0x0 has_reply_route = <value optimized out> __FUNCTION__ = "reply_received" ---Type <return> to continue, or q <return> to quit--- #9 0x0806c913 in forward_reply (msg=0x82d23b4) at forward.c:568 new_buf = <value optimized out> to = <value optimized out> new_len = <value optimized out> mod = 0x822ed98 proto = 136148112 id = <value optimized out> send_sock = 0x97879eff len = 137175988 __FUNCTION__ = "forward_reply" #10 0x080a4d32 in receive_msg ( buf=0x81d7480 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 77.66.2.136;branch=z9hG4bK6a98.fe2e7586.0;received=77.66.2.136\r\nFrom: <sip:+61430278886@77.66.2.136>;tag=as0939115b\r\nTo: <sip:011008801715609898@77.66.2.136:5060;user="..., len=429, rcv_info=<value optimized out>) at receive.c:203 msg = 0x82d23b4 start = {tv_sec = -1219149052, tv_usec = 137175912} __FUNCTION__ = "receive_msg" #11 0x080fbd5e in udp_rcv_loop () at udp_server.c:419 len = 429 buf = "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 77.66.2.136;branch=z9hG4bK6a98.fe2e7586.0;received=77.66.2.136\r\nFrom: <sip:+61430278886@77.66.2.136>;tag=as0939115b\r\nTo: <sip:011008801715609898@77.66.2.136:5060;user="... tmp = <value optimized out> from = 0x82d2368 fromlen = 16 ri = {src_ip = {af = 2, len = 4, u = {addrl = {2231517773, 3079082512, 3218406760, 1}, addr32 = {2231517773, 3079082512, 3218406760, 1}, addr16 = {16973, 34050, 4624, 46983, 64872, 49108, 1, 0}, addr = "MB\002\205\020\022\207\267h\375Ô¿\001\000\000"}}, dst_ip = {af = 2, len = 4, u = {addrl = { 2281849421, 0, 0, 0}, addr32 = {2281849421, 0, 0, 0}, addr16 = {16973, 34818, 0, 0, 0, 0, 0, 0}, addr = "MB\002\210", '\000' <repeats 11 times>}}, src_port = 5060, dst_port = 5060, proto = 1, proto_reserved1 = 0, proto_reserved2 = 0, src_su = {s = {sa_family = 2, sa_data = "\023\304MB\002\205\000\000\000\000\000\000\000"}, sin = {sin_family = 2, sin_port = 50195, sin_addr = {s_addr = 2231517773}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 50195, sin6_flowinfo = 2231517773, sin6_addr = {__in6_u = { __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}}, bind_address = 0x822dae0} p = <value optimized out> __FUNCTION__ = "udp_rcv_loop" #12 0x08074d28 in main_loop (argc=9, argv=0xbfd4fe94) at main.c:879 i = <value optimized out> pid = <value optimized out> si = <value optimized out> startup_done = 0x0 chd_rank = 4 #13 main (argc=9, argv=0xbfd4fe94) at main.c:1497 cfg_log_stderr = 136501984 cfg_stream = 0x3 ---Type <return> to continue, or q <return> to quit--- c = <value optimized out> r = <value optimized out> tmp = 0xbfd51970 "" tmp_len = <value optimized out> port = <value optimized out> proto = <value optimized out> ret = <value optimized out> seed = 734867912 rfd = <value optimized out> __FUNCTION__ = "main" ---------------------------------------------------------------------- Comment By: Muhammad Shahzad (shari_786pk) Date: 2011-08-12 13:54 Message: well, as far as i remember 8201 never crashed on its own, however, i did observed crash a couple of time while trying to stop the service (for maintenance reasons), but ignored them. The back trace of those crashes showed mysql update failed while sync dialog run time info to db. I didn't saved those core dumps to can't give much details. Anyways, my production server is now on 8201 and let see if it crashes, i will send you its BT. ---------------------------------------------------------------------- Comment By: Vladut-Stefan Paiu (vladut-paiu) Date: 2011-08-12 13:37 Message: Hello again, This second backtrace seems somehow corrupt. You are saying the 8201 is ok, no crashes as the reported ones ? While 8215 shows these problems ? ---------------------------------------------------------------------- Comment By: Muhammad Shahzad (shari_786pk) Date: 2011-08-12 12:35 Message: i updated the trunk to revision 8242, and now it crashes due to some memory leak, may be its a different bug. Anyway, here is BT. Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid -m 512 -u root -g root'. Program terminated with signal 6, Aborted. #0 0xb7706424 in __kernel_vsyscall () (gdb) bt #0 0xb7706424 in __kernel_vsyscall () #1 0xb741e751 in raise () from /lib/i686/cmov/libc.so.6 #2 0xb7421b82 in abort () from /lib/i686/cmov/libc.so.6 #3 0x0810344e in qm_insert_free (qm=0xb70b0337, p=0x970dac6c, file=0xb70b0337 "L:dialog:%s: extract_ftc_hdrs ok but no from extracted : [%.*s]\n", func=0xb70b0dbb "g:%s: no more shm\n", line=171) at mem/q_malloc.c:169 #4 qm_free (qm=0xb70b0337, p=0x970dac6c, file=0xb70b0337 "L:dialog:%s: extract_ftc_hdrs ok but no from extracted : [%.*s]\n", func=0xb70b0dbb "g:%s: no more shm\n", line=171) at mem/q_malloc.c:491 #5 0xb7095cf6 in dlg_onreply (dialog=0x800) at dlg_handlers.c:477 #6 unreference_dialog_create (dialog=0x800) at dlg_handlers.c:776 #7 0xb709ef5f in unref_dlg (dlg=0x970da634, cnt=1) at dlg_hash.c:739 #8 0xb70a2a42 in new_dlg_profile (profiles=0x1 <Address 0x1 out of bounds>, has_value=3070347656) at dlg_profile.c:278 #9 add_profile_definitions (profiles=0x1 <Address 0x1 out of bounds>, has_value=3070347656) at dlg_profile.c:168 #10 0xb70a41c0 in send_leg_msg (dlg=0x979541a8, method=0x100, src_leg=-1078038796, dst_leg=0, hdrs=0x97369734, body=0x979ba664, func=0x81a9e68 <debug>, param=0x972ac048, release=0) at dlg_req_within.c:494 #11 0xb70e9657 in run_trans_callbacks (type=256, trans=0x979541a8, req=0x0, rpl=0xffffffff, code=408) at t_hooks.c:212 #12 0xb7101d93 in local_reply (t=0x979541a8, p_msg=0xffffffff, branch=0, msg_status=408, cancel_bitmap=0xbfbe702c) at t_reply.c:1358 #13 0xb70ec0a3 in fake_reply (ticks=70, attr=0x0) at timer.c:253 #14 final_response_handler (ticks=70, attr=0x0) at timer.c:364 #15 timer_routine (ticks=70, attr=0x0) at timer.c:1005 #16 0x080e42fa in utimer_ticker () at timer.c:369 #17 run_timer_process () at timer.c:407 #18 start_timer_processes () at timer.c:522 #19 0x080751bd in main_loop (argc=9, argv=0xbfbe71e4) at main.c:936 #20 main (argc=9, argv=0xbfbe71e4) at main.c:1497 It has crashes 3 times in last six hours, so i am think of reverting it to last stable revision 8201. :( ---------------------------------------------------------------------- Comment By: Muhammad Shahzad (shari_786pk) Date: 2011-08-12 12:23 Message: I am afraid i won't be able to do that since crashes are random and the server where its happening have a lot of production traffic (1500+ sip registrations, 300+ calls etc.) on it. I do have the core dump though, but their sizes are around 500+MB. ---------------------------------------------------------------------- Comment By: Vladut-Stefan Paiu (vladut-paiu) Date: 2011-08-12 11:27 Message: Hello Muhammad, Is it possible that you also attach an OpenSIPS log in full debug for a call where this crash happens ? I have an idea about the crash, but I would need a log in debug=4 to be sure. Thanks and Regards, Vlad ---------------------------------------------------------------------- Comment By: Muhammad Shahzad (shari_786pk) Date: 2011-08-12 02:50 Message: Another crash, this time i was watching opensips logs and got this additional info, Aug 11 23:46:05 nghw-2674 osip-service[1803]: CRITICAL:core:qm_free: freeing already freed pointer, first free: dlg_hash.c: free_dlg_dlg(171) - aborting Aug 11 23:46:21 nghw-2674 osip-service[1838]: CRITICAL:core:receive_fd: EOF on 37 Aug 11 23:46:21 nghw-2674 osip-service[1779]: NOTICE:presence:destroy: destroy module ... ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3390207&group_id=232389 _______________________________________________ Devel mailing list Devel@lists.opensips.org http://lists.opensips.org/cgi-bin/mailman/listinfo/devel