Bugs item #3390207, was opened at 2011-08-11 22:25
Message generated for change (Comment added) made by vladut-paiu
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3390207&group_id=232389

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: modules
Group: trunk
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Muhammad Shahzad (shari_786pk)
Assigned to: Vladut-Stefan Paiu (vladut-paiu)
Summary: Segfault while freeing dialog

Initial Comment:
Hi,

OpenSIPs crashes randomly while processing Bye reply, when destroying / freeing 
the associated dialog. So far, this has happened 4 times in last 2 days on one 
of my production servers. There are two core dumps generated upon each crash,

BT on first core dump gives following,

Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid -m 512 
-u root -g root'.
Program terminated with signal 11, Segmentation fault.
#0  0x97761500 in ?? ()
(gdb) bt
#0  0x97761500 in ?? ()
#1  0xb723abd3 in free_dlg_dlg (dlg=<value optimized out>) at dlg_hash.c:163
#2  0xb7243f5f in destroy_dlg (dlg=0x979d04a0, cnt=1) at dlg_hash.c:225
#3  unref_dlg (dlg=0x979d04a0, cnt=1) at dlg_hash.c:742
#4  0xb7247a42 in dual_bye_event (dlg=0x979d04a0, req=<value optimized out>, 
extra_unref=1) at dlg_req_within.c:274
#5  0xb72491c0 in bye_reply_cb (t=0x976af43c, type=256, ps=0xbfddce54) at 
dlg_req_within.c:294
#6  0xb728e657 in run_trans_callbacks (type=256, trans=0x976af43c, req=0x0, 
rpl=0x82d241c, code=200) at t_hooks.c:212
#7  0xb72a6d93 in local_reply (t=0x976af43c, p_msg=0x82d241c, branch=0, 
msg_status=200, cancel_bitmap=0xbfddcf88) at t_reply.c:1358
#8  0xb72a82c2 in reply_received (p_msg=0x82d241c) at t_reply.c:1503
#9  0x0806c913 in forward_reply (msg=0x82d241c) at forward.c:568
#10 0x080a4d32 in receive_msg (
    buf=0x81d7480 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 
77.66.2.136;branch=z9hG4bK2aa.63be8597.0;received=77.66.2.136\r\nFrom: 
<sip:+642102908458@77.66.2.136>;tag=as2bde889d\r\nTo: 
<sip:01100165022863605@77.66.2.136:5060;user=p"..., len=428, rcv_info=<value 
optimized out>) at receive.c:203
#11 0x080fbd5e in udp_rcv_loop () at udp_server.c:419
#12 0x08074d28 in main_loop (argc=9, argv=0xbfddd2f4) at main.c:879
#13 main (argc=9, argv=0xbfddd2f4) at main.c:1497


BT on second core dump gives following,

Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid -m 512 
-u root -g root'.
Program terminated with signal 6, Aborted.
#0  0xb78ab424 in __kernel_vsyscall ()
(gdb) bt
#0  0xb78ab424 in __kernel_vsyscall ()
#1  0xb75c3751 in raise () from /lib/i686/cmov/libc.so.6
#2  0xb75c6b82 in abort () from /lib/i686/cmov/libc.so.6
#3  0x08070d64 in sig_alarm_abort (signo=14) at main.c:435
#4  <signal handler called>
#5  get_lock (ticks=0, param=0x0) at ../../evi/../fastlock.h:178
#6  dialog_update_db (ticks=0, param=0x0) at dlg_db_handler.c:972
#7  0xb72249e1 in mod_destroy () at dialog.c:755
#8  0x080ce108 in destroy_modules () at sr_module.c:371
#9  0x08070858 in cleanup (show_status=1) at main.c:344
#10 0x080714a9 in handle_sigs () at main.c:545
#11 0x08075245 in main_loop (argc=9, argv=0xbfddd2f4) at main.c:990
#12 main (argc=9, argv=0xbfddd2f4) at main.c:1497

The most disturbing part of these crashes for me is that it causes all running 
calls to hang up. Previously in case of OpenSIPs crash running calls use to 
remain unaffected.

I am using SVN trunk revision 8215, and have create_dialog("BPp") on very first 
INVITE in my dial plan. 

Feel free to ask for more info as needed.

Thank you.


----------------------------------------------------------------------

>Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2011-08-19 15:05

Message:
Hello again,

Any success in reproducing the scenario ?

Also, another thing that you can do, both on your production machine and
on your close, would be to enable memory debugging in your script. First of
all, you would have to edit Makefile.defs, uncomment the line with
    -DDBG_QM_MALLOC \
and comment the line with
    -DF_MALLOC \

and then recompile OpenSIPS. 

Last, you would have to edit your .cfg and add the following two lines :
    memlog=6
    memdump=1

This will not do excessive logging of any kind, like turning up the
general log level, but it would give us extra information about the memory
in case of the crash.

Thanks and Regards,
Vlad

----------------------------------------------------------------------

Comment By: Muhammad Shahzad (shari_786pk)
Date: 2011-08-18 09:44

Message:
Since, my production machine is stable now so, i won't risk it to crash
again. However, i do have a production clone, with exact same configuration
as production. Let me put my team on it and see if they could reproduce it
there while debug logs and sip traces are enabled.

Thank you.



----------------------------------------------------------------------

Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2011-08-17 19:55

Message:
Removing the "PpB" parameters "solved" your problem because that's where
the bug originated. OpenSIPS encounters a double free while processing the
reply to a 200 Ok to a BYE generated internally by OpenSIPS. Which can be
the case either when pinging is enabled and one endpoint does not respond
in time, or when the overall dialog time limit is exceeded. 

I have failed to reproduce your scenario, so It would help a lot if you
could either provide a full debug log, or if that isn't possible, maybe a
full SIP trace for the call that makes OpenSIPS crash.

And yes, the load_balancer issue seems like a different issue. Please open
a separate bug report for this.

----------------------------------------------------------------------

Comment By: Muhammad Shahzad (shari_786pk)
Date: 2011-08-17 10:14

Message:
I have removed all parameters from create_dialog function and system seems
bit stable now. Though i am still watching for crash.

However, this has created a new problem, that is, i am also using
load_balancer module which is relaying calls to my media servers array. Now
the problem is that whenever i execute lb_list command via any management
interface i get high number of calls on each media server, while checking
the media servers i get much less number of running calls. Which implies
something is wrong with load_balancer module stats. I wonder what will
happen if this wrong call count reaches to call capacity mentioned in
load_balancer db table?

Should i open a separate bug for this problem?

Thank you.



----------------------------------------------------------------------

Comment By: Muhammad Shahzad (shari_786pk)
Date: 2011-08-15 20:37

Message:
couldn't found any thing matching "bogus ref" but do find a critical
warning matching "bogus event", just before opensips crash.

CRITICAL:dialog:log_next_state_dlg: bogus event 7 in state 2 for dlg
0x970b8874 [138:1657090481] with clid
'584da4f976b14ce053d3c9345a09391e@77.66.2.136' and tags 'as6fd68670'
'SDopvfa99-9302873325129137220'
CRITICAL:core:receive_fd: EOF on 37
NOTICE:presence:destroy: destroy module ...
CRITICAL:core:sig_alarm_abort: BUG - shutdown timeout triggered, dying...



----------------------------------------------------------------------

Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2011-08-15 18:26

Message:
Hello,

Could you please check the logs for critical messages saying "bogus ref"
on the calls that you experience the crashes ?

----------------------------------------------------------------------

Comment By: Muhammad Shahzad (shari_786pk)
Date: 2011-08-13 16:48

Message:
It crashed again. This time with revision 8201. BT indicates same problem.
Here is BT full.


#0  0x97879f00 in ?? ()
No symbol table info available.
#1  0xb71ec703 in free_dlg_dlg (dlg=<value optimized out>) at
dlg_hash.c:163
        i = 0
        __FUNCTION__ = "free_dlg_dlg"
#2  0xb71f5a7f in destroy_dlg (dlg=0x9784ae04, cnt=1) at dlg_hash.c:225
        ret = <value optimized out>
#3  unref_dlg (dlg=0x9784ae04, cnt=1) at dlg_hash.c:742
        d_entry = 0x970a8c50
        __FUNCTION__ = "unref_dlg"
#4  0xb71f9562 in dual_bye_event (dlg=0x9784ae04, req=<value optimized
out>, extra_unref=1) at dlg_req_within.c:274
        old_state = 5
        new_state = 5
        unref = 1
        ret = <value optimized out>
        __FUNCTION__ = "dual_bye_event"
#5  0xb71face0 in bye_reply_cb (t=0x97760f68, type=256, ps=0xbfd4f9f4) at
dlg_req_within.c:294
        __FUNCTION__ = "bye_reply_cb"
#6  0xb7240657 in run_trans_callbacks (type=256, trans=0x97760f68,
req=0x0, rpl=0x82d23b4, code=200) at t_hooks.c:212
        params = {req = 0x0, rpl = 0x82d23b4, code = 200, param =
0x97451d2c, extra1 = 0x0, extra2 = 0x0}
        cbp = 0x97451d20
        backup = 0x81e7488
        trans_backup = 0x97760f68
        __FUNCTION__ = "run_trans_callbacks"
#7  0xb7258d93 in local_reply (t=0x97760f68, p_msg=0x82d23b4, branch=0,
msg_status=200, cancel_bitmap=0xbfd4fb28) at t_reply.c:1358
        local_store = 0
        local_winner = 0
        reply_status = RPS_COMPLETED
        winning_msg = 0x974278c0
        winning_code = 200
        totag_retr = 0
        __FUNCTION__ = "local_reply"
#8  0xb725a2c2 in reply_received (p_msg=0x82d23b4) at t_reply.c:1503
        last_uac_status = <value optimized out>
        branch = 0
        reply_status = <value optimized out>
        timer = <value optimized out>
        cancel_bitmap = 0
        t = 0x97760f68
        backup_list = 0x0
        has_reply_route = <value optimized out>
        __FUNCTION__ = "reply_received"
---Type <return> to continue, or q <return> to quit---
#9  0x0806c913 in forward_reply (msg=0x82d23b4) at forward.c:568
        new_buf = <value optimized out>
        to = <value optimized out>
        new_len = <value optimized out>
        mod = 0x822ed98
        proto = 136148112
        id = <value optimized out>
        send_sock = 0x97879eff
        len = 137175988
        __FUNCTION__ = "forward_reply"
#10 0x080a4d32 in receive_msg (
    buf=0x81d7480 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP
77.66.2.136;branch=z9hG4bK6a98.fe2e7586.0;received=77.66.2.136\r\nFrom:
<sip:+61430278886@77.66.2.136>;tag=as0939115b\r\nTo:
<sip:011008801715609898@77.66.2.136:5060;user="..., len=429,
rcv_info=<value optimized out>) at receive.c:203
        msg = 0x82d23b4
        start = {tv_sec = -1219149052, tv_usec = 137175912}
        __FUNCTION__ = "receive_msg"
#11 0x080fbd5e in udp_rcv_loop () at udp_server.c:419
        len = 429
        buf = "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP
77.66.2.136;branch=z9hG4bK6a98.fe2e7586.0;received=77.66.2.136\r\nFrom:
<sip:+61430278886@77.66.2.136>;tag=as0939115b\r\nTo:
<sip:011008801715609898@77.66.2.136:5060;user="...
        tmp = <value optimized out>
        from = 0x82d2368
        fromlen = 16
        ri = {src_ip = {af = 2, len = 4, u = {addrl = {2231517773,
3079082512, 3218406760, 1}, addr32 = {2231517773, 3079082512, 3218406760,
1}, addr16 = {16973,
                34050, 4624, 46983, 64872, 49108, 1, 0}, addr =
"MB\002\205\020\022\207\267h\375Ô¿\001\000\000"}}, dst_ip = {af = 2, len
= 4, u = {addrl = {
                2281849421, 0, 0, 0}, addr32 = {2281849421, 0, 0, 0},
addr16 = {16973, 34818, 0, 0, 0, 0, 0, 0}, addr = "MB\002\210", '\000'
<repeats 11 times>}},
          src_port = 5060, dst_port = 5060, proto = 1, proto_reserved1 =
0, proto_reserved2 = 0, src_su = {s = {sa_family = 2,
              sa_data = "\023\304MB\002\205\000\000\000\000\000\000\000"},
sin = {sin_family = 2, sin_port = 50195, sin_addr = {s_addr = 2231517773},
              sin_zero = "\000\000\000\000\000\000\000"}, sin6 =
{sin6_family = 2, sin6_port = 50195, sin6_flowinfo = 2231517773, sin6_addr
= {__in6_u = {
                  __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 =
{0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id =
0}},
          bind_address = 0x822dae0}
        p = <value optimized out>
        __FUNCTION__ = "udp_rcv_loop"
#12 0x08074d28 in main_loop (argc=9, argv=0xbfd4fe94) at main.c:879
        i = <value optimized out>
        pid = <value optimized out>
        si = <value optimized out>
        startup_done = 0x0
        chd_rank = 4
#13 main (argc=9, argv=0xbfd4fe94) at main.c:1497
        cfg_log_stderr = 136501984
        cfg_stream = 0x3
---Type <return> to continue, or q <return> to quit---
        c = <value optimized out>
        r = <value optimized out>
        tmp = 0xbfd51970 ""
        tmp_len = <value optimized out>
        port = <value optimized out>
        proto = <value optimized out>
        ret = <value optimized out>
        seed = 734867912
        rfd = <value optimized out>
        __FUNCTION__ = "main"


----------------------------------------------------------------------

Comment By: Muhammad Shahzad (shari_786pk)
Date: 2011-08-12 13:54

Message:
well, as far as i remember 8201 never crashed on its own, however, i did
observed crash a couple of time while trying to stop the service (for
maintenance reasons), but ignored them. The back trace of those crashes
showed mysql update failed while sync dialog run time info to db. I didn't
saved those core dumps to can't give much details.

Anyways, my production server is now on 8201 and let see if it crashes, i
will send you its BT.

----------------------------------------------------------------------

Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2011-08-12 13:37

Message:
Hello again,

This second backtrace seems somehow corrupt. 

You are saying the 8201 is ok, no crashes as the reported ones ? While
8215 shows these problems ?

----------------------------------------------------------------------

Comment By: Muhammad Shahzad (shari_786pk)
Date: 2011-08-12 12:35

Message:
i updated the trunk to revision 8242, and now it crashes due to some memory
leak, may be its a different bug. Anyway, here is BT.

Core was generated by `/usr/local/sbin/opensips -P /var/run/opensips.pid
-m 512 -u root -g root'.
Program terminated with signal 6, Aborted.
#0  0xb7706424 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7706424 in __kernel_vsyscall ()
#1  0xb741e751 in raise () from /lib/i686/cmov/libc.so.6
#2  0xb7421b82 in abort () from /lib/i686/cmov/libc.so.6
#3  0x0810344e in qm_insert_free (qm=0xb70b0337, p=0x970dac6c,
file=0xb70b0337 "L:dialog:%s: extract_ftc_hdrs ok but no from extracted :
[%.*s]\n",
    func=0xb70b0dbb "g:%s: no more shm\n", line=171) at
mem/q_malloc.c:169
#4  qm_free (qm=0xb70b0337, p=0x970dac6c, file=0xb70b0337 "L:dialog:%s:
extract_ftc_hdrs ok but no from extracted : [%.*s]\n", func=0xb70b0dbb
"g:%s: no more shm\n",
    line=171) at mem/q_malloc.c:491
#5  0xb7095cf6 in dlg_onreply (dialog=0x800) at dlg_handlers.c:477
#6  unreference_dialog_create (dialog=0x800) at dlg_handlers.c:776
#7  0xb709ef5f in unref_dlg (dlg=0x970da634, cnt=1) at dlg_hash.c:739
#8  0xb70a2a42 in new_dlg_profile (profiles=0x1 <Address 0x1 out of
bounds>, has_value=3070347656) at dlg_profile.c:278
#9  add_profile_definitions (profiles=0x1 <Address 0x1 out of bounds>,
has_value=3070347656) at dlg_profile.c:168
#10 0xb70a41c0 in send_leg_msg (dlg=0x979541a8, method=0x100,
src_leg=-1078038796, dst_leg=0, hdrs=0x97369734, body=0x979ba664,
func=0x81a9e68 <debug>,
    param=0x972ac048, release=0) at dlg_req_within.c:494
#11 0xb70e9657 in run_trans_callbacks (type=256, trans=0x979541a8,
req=0x0, rpl=0xffffffff, code=408) at t_hooks.c:212
#12 0xb7101d93 in local_reply (t=0x979541a8, p_msg=0xffffffff, branch=0,
msg_status=408, cancel_bitmap=0xbfbe702c) at t_reply.c:1358
#13 0xb70ec0a3 in fake_reply (ticks=70, attr=0x0) at timer.c:253
#14 final_response_handler (ticks=70, attr=0x0) at timer.c:364
#15 timer_routine (ticks=70, attr=0x0) at timer.c:1005
#16 0x080e42fa in utimer_ticker () at timer.c:369
#17 run_timer_process () at timer.c:407
#18 start_timer_processes () at timer.c:522
#19 0x080751bd in main_loop (argc=9, argv=0xbfbe71e4) at main.c:936
#20 main (argc=9, argv=0xbfbe71e4) at main.c:1497

It has crashes 3 times in last six hours, so i am think of reverting it to
last stable revision 8201. :(


----------------------------------------------------------------------

Comment By: Muhammad Shahzad (shari_786pk)
Date: 2011-08-12 12:23

Message:
I am afraid i won't be able to do that since crashes are random and the
server where its happening have a lot of production traffic (1500+ sip
registrations, 300+ calls etc.) on it.

I do have the core dump though, but their sizes are around 500+MB.


----------------------------------------------------------------------

Comment By: Vladut-Stefan Paiu (vladut-paiu)
Date: 2011-08-12 11:27

Message:
Hello Muhammad,

Is it possible that you also attach an OpenSIPS log in full debug for a
call where this crash happens ? I have an idea about the crash, but I would
need a log in debug=4 to be sure.

Thanks and Regards,
Vlad

----------------------------------------------------------------------

Comment By: Muhammad Shahzad (shari_786pk)
Date: 2011-08-12 02:50

Message:
Another crash, this time i was watching opensips logs and got this
additional info,


Aug 11 23:46:05 nghw-2674 osip-service[1803]: CRITICAL:core:qm_free:
freeing already freed pointer, first free: dlg_hash.c: free_dlg_dlg(171) -
aborting 
Aug 11 23:46:21 nghw-2674 osip-service[1838]: CRITICAL:core:receive_fd:
EOF on 37 
Aug 11 23:46:21 nghw-2674 osip-service[1779]: NOTICE:presence:destroy:
destroy module ... 



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=1086410&aid=3390207&group_id=232389

_______________________________________________
Devel mailing list
Devel@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/devel

Reply via email to