Hello again, I'm sorry for spamming the mail list but we continue our testing and come across new crashes. This time we took all-in-one VM and limited CPU to 1 core (to avoid Bono crashes). We found out that crashes begin when we hit ~15k open live sockets. The thing is we expect that Clearwater IMS will stop processing registration/call requests but will preserve ongoing calls but as we saw during sipp test but instead most connections are forcefully closed and dump is generated. I am adding some more dumps (homer this time) and sip logs. Could you please tell me if this kind of behavior during testing is kind of normal, expected behavior?
Dumps: https://www.dropbox.com/sh/bl5ghgwrpum6pq9/AAA_UQV7v9NfG3y8q0lOP0Xra?dl=0 BR, Stanislav Khalup From: Stanislav Khalup Sent: Monday, September 5, 2016 5:29 PM To: 'Richard Whitehouse' <richard.whiteho...@metaswitch.com>; 'clearwater@lists.projectclearwater.org' <clearwater@lists.projectclearwater.org> Cc: Denis Plotnikov <dplotni...@virtuozzo.com> Subject: RE: Sprout/Bono/Chronos crashes under stress test Hello all, We continue our stress tests to understand some dependencies behind crashes but it seems there is no such thing. This time we installed the latest all-in-one node and limited it to 1 CPU. Then we ran many sipp stress tests. We believed that in this case we will get application crashes only when load hits some considerable level but it seems crashes and actual number of registration attempts/calls ongoing are not or poorly related. The latest dumps are place here: https://www.dropbox.com/sh/ckjr8oi3rll5y78/AAArJDu7WItDxJjipdDMx7TVa?dl=0 Is this kind of re-occurring crashes expected from all-in-one node? BR, Stanislav Khalup From: Stanislav Khalup Sent: Friday, September 2, 2016 7:21 PM To: 'Richard Whitehouse' <richard.whiteho...@metaswitch.com<mailto:richard.whiteho...@metaswitch.com>>; clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org> Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>> Subject: RE: Sprout/Bono/Chronos crashes under stress test Richard, Thank you very much for your response. Let me add some details. Initially we tested all component in VMs with 2 vCPUs each but after reading the list we changed bono config to 1 vCPU, for sprout we left 2 vCPUs. As for bono the crashes were somehow resolved but for sprout the situation is the same. We experience the same kind of crashes with all-in-one image (with 8vCPU). As for SNMP statistics I don't know whether this is related or not but we couldn't get bono/sprout functional statistics like: The number of incoming requests, indexed by time period or The number of requests rejected due to overload, indexed by time period. - those metrics were always zero. I've added a pair of chronos dumps to dropbox folder. Maybe they can shed some more light on the problem: https://www.dropbox.com/sh/qjdja9eowgvo1zc/AADm25_pwKNs3gWwBmb0Pzhpa?dl=0 BR, Stanislav Khalup From: Richard Whitehouse [mailto:richard.whiteho...@metaswitch.com] Sent: Friday, September 2, 2016 5:19 PM To: clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>; Stanislav Khalup <skha...@virtuozzo.com<mailto:skha...@virtuozzo.com>> Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>> Subject: RE: Sprout/Bono/Chronos crashes under stress test Stanislav, I've taken a look at the Sprout crash. It looks like you have are hitting a crash in the Net SNMP library we use for alarms and statistics. I've raised an issue to track this - https://github.com/Metaswitch/sprout/issues/1527 We've seen similar looking stacks for Bono before on multi-core VMs - e.g. under http://lists.projectclearwater.org/pipermail/clearwater_lists.projectclearwater.org/2015-January/001986.html Historically we've scaled up Sprout and Bono by running many single or dual-core instances rather than running fewer larger instances - this is because - we've seen virtualization environments impose per-VM limits on TCP connection counts, and obviously Bono has large numbers of TCP connections in a real-world scenario - we only support a single transport thread and, since Bono performs relatively little processing per message, and Sprout needs to perform some processing per message, it is this that ends up being the bottleneck quite quickly. Generally we've run single core Bono nodes, and dual core Sprout nodes. Having said that, we should look into why it's crashing when you're running more cores. Can you give us some description of the scenario you are running under when you see this? You might also find it useful to subscribe to the mailing list so that you receive updates when we push out updated releases. Thanks, Richard From: Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org] On Behalf Of Stanislav Khalup Sent: 01 September 2016 10:49 To: clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org> Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>> Subject: [Project Clearwater] Sprout/Bono/Chronos crashes under stress test Hello all, We've been trying to perform IMS stress testing for some time now but it seems that we are really unlucky. When we perform sip test we experience constant bono/sprout crashes which affects results of our performance evaluation. The thing is we do know that generally our deployment is working (we managed to perform calls and run tests). At first we manually deployed IMS cluster but after crashes we decided to try all in one VM but we still experience sprout crashes (bono crashes are mostly fixed after setting 1CPU/1Worker). Could you please look at the dumps: https://www.dropbox.com/sh/qjdja9eowgvo1zc/AADm25_pwKNs3gWwBmb0Pzhpa?dl=0 because for now we have no clue for what is happening. BR, Stanislav
_______________________________________________ Clearwater mailing list Clearwater@lists.projectclearwater.org http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org