Hi all,

For the sake of completion, here is the commit fixing the issue:
https://github.com/OpenSIPS/opensips/commit/058cc22cb55dce9b890308b9f83a42a88691f2c8

Thank you Yuval for the report and for investigating this!

Best regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 07/12/2018 04:07 PM, Yuval Dinari via Users wrote:
Hi,
I have a state in which opensips gets into an unrecoverable bad state, in which some of the tcp children process are stuck waiting to acquire a lock which they never get.
The issue occurs in the following load test scenario:

 1. About 25K clients register in TCP (but also happens with less)
 2. All the TCP connections become unresponsive (by blocking outgoing
    traffic on the test clients machine)
 3. INVITEs are sent for each of those clients, putting their
    connection in retransmit mode
 4. After a few minutes opensips gets into a bad state - some tcp
    children run at 90-100% cpu, no traffic is being sent from the
    machine (including OPTIONS pings)
 5. After all the tcp connections die due to timeouts, opensips does
    not recover, the mentioned symptoms stay
 6. After all the registered users are removed from internal table
    there's still no change

When attaching debugger to the problematic processes (with high cpu usage) we see that they're all stuck trying to get a lock which they never seem to get. Stack traces:

#0 0x00007fd6b72d1bb7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81 #1 0x0000000000549e65 in get_lock (lock=<optimized out>) at net/proto_tcp/../../net/../fastlock.h:221 #2 _tcp_write_on_socket (len=<optimized out>, buf=<optimized out>, fd=<optimized out>, c=<optimized out>) at net/proto_tcp/proto_tcp.c:724 #3 proto_tcp_send (send_sock=0x7ffd8e12c140, buf=0x0, len=399, to=0x7fd5c7ccdcc0, id=1) at net/proto_tcp/proto_tcp.c:922 #4 0x00007fd5a5cb7b30 in msg_send (msg=<optimized out>, len=<optimized out>, buf=<optimized out>, id=<optimized out>, to=<optimized out>, proto=<optimized out>,
    send_sock=0x7fd6a7208168) at ../../forward.h:123
#5 send_pr_buffer (rb=0x7fd5c7ccdca0, buf=0x7fd6a76b4a50, len=0, ctx=0xffffffffffffffff) at t_funcs.c:66

And:

#0 0x00007fd6b72d1bb7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81 #1 0x00000000005349b8 in get_lock (lock=<optimized out>) at net/../fastlock.h:221 #2 handle_io (event_type=<optimized out>, idx=<optimized out>, fm=<optimized out>) at net/net_tcp_proc.c:210 #3 io_wait_loop_epoll (repeat=287, t=<optimized out>, h=<optimized out>) at net/../io_wait_loop.h:280

This traces look the same every time we attach.
The machine opensips runs on has 4 cpus.
Thanks





_______________________________________________
Users mailing list
Users@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/users

_______________________________________________
Users mailing list
Users@lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/users

Reply via email to