Hi, I have a state in which opensips gets into an unrecoverable bad state, in which some of the tcp children process are stuck waiting to acquire a lock which they never get. The issue occurs in the following load test scenario:
1. About 25K clients register in TCP (but also happens with less) 2. All the TCP connections become unresponsive (by blocking outgoing traffic on the test clients machine) 3. INVITEs are sent for each of those clients, putting their connection in retransmit mode 4. After a few minutes opensips gets into a bad state - some tcp children run at 90-100% cpu, no traffic is being sent from the machine (including OPTIONS pings) 5. After all the tcp connections die due to timeouts, opensips does not recover, the mentioned symptoms stay 6. After all the registered users are removed from internal table there's still no change When attaching debugger to the problematic processes (with high cpu usage) we see that they're all stuck trying to get a lock which they never seem to get. Stack traces: #0 0x00007fd6b72d1bb7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81 #1 0x0000000000549e65 in get_lock (lock=<optimized out>) at net/proto_tcp/../../net/../fastlock.h:221 #2 _tcp_write_on_socket (len=<optimized out>, buf=<optimized out>, fd=<optimized out>, c=<optimized out>) at net/proto_tcp/proto_tcp.c:724 #3 proto_tcp_send (send_sock=0x7ffd8e12c140, buf=0x0, len=399, to=0x7fd5c7ccdcc0, id=1) at net/proto_tcp/proto_tcp.c:922 #4 0x00007fd5a5cb7b30 in msg_send (msg=<optimized out>, len=<optimized out>, buf=<optimized out>, id=<optimized out>, to=<optimized out>, proto=<optimized out>, send_sock=0x7fd6a7208168) at ../../forward.h:123 #5 send_pr_buffer (rb=0x7fd5c7ccdca0, buf=0x7fd6a76b4a50, len=0, ctx=0xffffffffffffffff) at t_funcs.c:66 And: #0 0x00007fd6b72d1bb7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81 #1 0x00000000005349b8 in get_lock (lock=<optimized out>) at net/../fastlock.h:221 #2 handle_io (event_type=<optimized out>, idx=<optimized out>, fm=<optimized out>) at net/net_tcp_proc.c:210 #3 io_wait_loop_epoll (repeat=287, t=<optimized out>, h=<optimized out>) at net/../io_wait_loop.h:280 This traces look the same every time we attach. The machine opensips runs on has 4 cpus. Thanks
_______________________________________________ Users mailing list Users@lists.opensips.org http://lists.opensips.org/cgi-bin/mailman/listinfo/users