Stanislav, Sorry to hear you are having these problems. We run stress regularly against Project Clearwater deployments, and we don't see such problems, so it's down to a difference in your setup.
We wouldn't ever expect to run stress against an all in one node - it's not designed for any particular capacity, it's designed to be use to manual trying out Clearwater performance, and trying it out initially. Instead, as per the Stress Testing instructions, we'd expect stress to either be run against a deployment done using Chef, or using a Manual Install, with Clearwater deployed on at least six separate boxes with Sprout, Homer, Homestead, Ralf, Bono and Ellis all instead on separate servers, with separate servers for the Sip Stress Node. We'd expect all of the VMs to have around 1VCPU and 2GB of RAM as documented in the Manual Install instructions. If greater performance is required, we'd expect this to be achieved by increasing the number of deployed VMs, based on where the system was stressed, rather than increasing the resources assigned to each VM. The latest dumps you have sent don't represent crashes, they represent processes being killed because they become unresponsive to the polling mechanism, due to the amount of load on the VM, and the process being unable to serve the monitoring request in time. Can you provide some background on what you are trying to find out from the testing you are doing? Thanks, Richard From: Stanislav Khalup [mailto:skha...@virtuozzo.com] Sent: 06 September 2016 16:14 To: Richard Whitehouse <richard.whiteho...@metaswitch.com>; clearwater@lists.projectclearwater.org Cc: Denis Plotnikov <dplotni...@virtuozzo.com> Subject: RE: Sprout/Bono/Chronos crashes under stress test Hello again, I'm sorry for spamming the mail list but we continue our testing and come across new crashes. This time we took all-in-one VM and limited CPU to 1 core (to avoid Bono crashes). We found out that crashes begin when we hit ~15k open live sockets. The thing is we expect that Clearwater IMS will stop processing registration/call requests but will preserve ongoing calls but as we saw during sipp test but instead most connections are forcefully closed and dump is generated. I am adding some more dumps (homer this time) and sip logs. Could you please tell me if this kind of behavior during testing is kind of normal, expected behavior? Dumps: https://www.dropbox.com/sh/bl5ghgwrpum6pq9/AAA_UQV7v9NfG3y8q0lOP0Xra?dl=0 BR, Stanislav Khalup From: Stanislav Khalup Sent: Monday, September 5, 2016 5:29 PM To: 'Richard Whitehouse' <richard.whiteho...@metaswitch.com<mailto:richard.whiteho...@metaswitch.com>>; 'clearwater@lists.projectclearwater.org' <clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>> Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>> Subject: RE: Sprout/Bono/Chronos crashes under stress test Hello all, We continue our stress tests to understand some dependencies behind crashes but it seems there is no such thing. This time we installed the latest all-in-one node and limited it to 1 CPU. Then we ran many sipp stress tests. We believed that in this case we will get application crashes only when load hits some considerable level but it seems crashes and actual number of registration attempts/calls ongoing are not or poorly related. The latest dumps are place here: https://www.dropbox.com/sh/ckjr8oi3rll5y78/AAArJDu7WItDxJjipdDMx7TVa?dl=0 Is this kind of re-occurring crashes expected from all-in-one node? BR, Stanislav Khalup From: Stanislav Khalup Sent: Friday, September 2, 2016 7:21 PM To: 'Richard Whitehouse' <richard.whiteho...@metaswitch.com<mailto:richard.whiteho...@metaswitch.com>>; clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org> Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>> Subject: RE: Sprout/Bono/Chronos crashes under stress test Richard, Thank you very much for your response. Let me add some details. Initially we tested all component in VMs with 2 vCPUs each but after reading the list we changed bono config to 1 vCPU, for sprout we left 2 vCPUs. As for bono the crashes were somehow resolved but for sprout the situation is the same. We experience the same kind of crashes with all-in-one image (with 8vCPU). As for SNMP statistics I don't know whether this is related or not but we couldn't get bono/sprout functional statistics like: The number of incoming requests, indexed by time period or The number of requests rejected due to overload, indexed by time period. - those metrics were always zero. I've added a pair of chronos dumps to dropbox folder. Maybe they can shed some more light on the problem: https://www.dropbox.com/sh/qjdja9eowgvo1zc/AADm25_pwKNs3gWwBmb0Pzhpa?dl=0 BR, Stanislav Khalup From: Richard Whitehouse [mailto:richard.whiteho...@metaswitch.com] Sent: Friday, September 2, 2016 5:19 PM To: clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>; Stanislav Khalup <skha...@virtuozzo.com<mailto:skha...@virtuozzo.com>> Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>> Subject: RE: Sprout/Bono/Chronos crashes under stress test Stanislav, I've taken a look at the Sprout crash. It looks like you have are hitting a crash in the Net SNMP library we use for alarms and statistics. I've raised an issue to track this - https://github.com/Metaswitch/sprout/issues/1527 We've seen similar looking stacks for Bono before on multi-core VMs - e.g. under http://lists.projectclearwater.org/pipermail/clearwater_lists.projectclearwater.org/2015-January/001986.html Historically we've scaled up Sprout and Bono by running many single or dual-core instances rather than running fewer larger instances - this is because - we've seen virtualization environments impose per-VM limits on TCP connection counts, and obviously Bono has large numbers of TCP connections in a real-world scenario - we only support a single transport thread and, since Bono performs relatively little processing per message, and Sprout needs to perform some processing per message, it is this that ends up being the bottleneck quite quickly. Generally we've run single core Bono nodes, and dual core Sprout nodes. Having said that, we should look into why it's crashing when you're running more cores. Can you give us some description of the scenario you are running under when you see this? You might also find it useful to subscribe to the mailing list so that you receive updates when we push out updated releases. Thanks, Richard From: Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org] On Behalf Of Stanislav Khalup Sent: 01 September 2016 10:49 To: clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org> Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>> Subject: [Project Clearwater] Sprout/Bono/Chronos crashes under stress test Hello all, We've been trying to perform IMS stress testing for some time now but it seems that we are really unlucky. When we perform sip test we experience constant bono/sprout crashes which affects results of our performance evaluation. The thing is we do know that generally our deployment is working (we managed to perform calls and run tests). At first we manually deployed IMS cluster but after crashes we decided to try all in one VM but we still experience sprout crashes (bono crashes are mostly fixed after setting 1CPU/1Worker). Could you please look at the dumps: https://www.dropbox.com/sh/qjdja9eowgvo1zc/AADm25_pwKNs3gWwBmb0Pzhpa?dl=0 because for now we have no clue for what is happening. BR, Stanislav
_______________________________________________ Clearwater mailing list Clearwater@lists.projectclearwater.org http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org