Chronos crashes under stress test

Richard Whitehouse (projectclearwater.org) Tue, 06 Sep 2016 09:33:11 -0700

Stanislav,

Sorry to hear you are having these problems. We run stress regularly against 
Project Clearwater deployments, and we don't see such problems, so it's down to 
a difference in your setup.


We wouldn't ever expect to run stress against an all in one node - it's not 
designed for any particular capacity, it's designed to be use to manual trying 
out Clearwater performance, and trying it out initially.

Instead, as per the Stress Testing instructions, we'd expect stress to either 
be run against a deployment done using Chef, or using a Manual Install, with 
Clearwater deployed on at least six separate boxes with Sprout, Homer, 
Homestead, Ralf, Bono and Ellis all instead on separate servers, with separate 
servers for the Sip Stress Node. We'd expect all of the VMs to have around 
1VCPU and 2GB of RAM as documented in the Manual Install instructions. If 
greater performance is required, we'd expect this to be achieved by increasing 
the number of deployed VMs, based on where the system was stressed, rather than 
increasing the resources assigned to each VM.

The latest dumps you have sent don't represent crashes, they represent 
processes being killed because they become unresponsive to the polling 
mechanism, due to the amount of load on the VM, and the process being unable to 
serve the monitoring request in time.

Can you provide some background on what you are trying to find out from the 
testing you are doing?


Thanks,

Richard

From: Stanislav Khalup [mailto:skha...@virtuozzo.com]
Sent: 06 September 2016 16:14
To: Richard Whitehouse <richard.whiteho...@metaswitch.com>; 
clearwater@lists.projectclearwater.org
Cc: Denis Plotnikov <dplotni...@virtuozzo.com>
Subject: RE: Sprout/Bono/Chronos crashes under stress test

Hello again,

I'm sorry for spamming  the mail list but we continue our testing and come 
across new crashes. This time we took all-in-one VM and limited CPU to 1 core 
(to avoid Bono crashes). We found out that crashes begin when we hit ~15k open 
live sockets. The thing is we expect that Clearwater IMS will stop processing 
registration/call requests but will preserve ongoing calls but as we saw during 
sipp test but instead most connections are forcefully closed and dump is 
generated. I am adding some more dumps (homer this time) and sip logs. Could 
you please tell me if this kind of behavior during testing is kind of normal, 
expected behavior?

Dumps: https://www.dropbox.com/sh/bl5ghgwrpum6pq9/AAA_UQV7v9NfG3y8q0lOP0Xra?dl=0

BR,
Stanislav Khalup

From: Stanislav Khalup
Sent: Monday, September 5, 2016 5:29 PM
To: 'Richard Whitehouse' 
<richard.whiteho...@metaswitch.com<mailto:richard.whiteho...@metaswitch.com>>; 
'clearwater@lists.projectclearwater.org' 
<clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>>
Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>>
Subject: RE: Sprout/Bono/Chronos crashes under stress test

Hello all,

We continue our stress tests to understand some dependencies behind crashes but 
it seems there is no such thing. This time we installed the latest all-in-one 
node and limited it to 1 CPU. Then we ran many sipp stress tests. We believed 
that in this case we will get application crashes only when load hits some 
considerable level but it seems crashes and actual number of registration 
attempts/calls ongoing are not or poorly related. The latest dumps are place 
here: https://www.dropbox.com/sh/ckjr8oi3rll5y78/AAArJDu7WItDxJjipdDMx7TVa?dl=0 
Is this kind of re-occurring crashes expected from all-in-one node?

BR,
Stanislav Khalup

From: Stanislav Khalup
Sent: Friday, September 2, 2016 7:21 PM
To: 'Richard Whitehouse' 
<richard.whiteho...@metaswitch.com<mailto:richard.whiteho...@metaswitch.com>>; 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>>
Subject: RE: Sprout/Bono/Chronos crashes under stress test

Richard,

Thank you very much for your response. Let me add some details.  Initially we 
tested all component in VMs with 2 vCPUs each but after reading the list we 
changed bono config to 1 vCPU, for sprout we left 2 vCPUs. As for bono the 
crashes were somehow resolved but for sprout the situation is the same. We 
experience the same kind of crashes with all-in-one image (with 8vCPU).
As for SNMP statistics I don't know whether this is related or not but we 
couldn't get bono/sprout functional statistics like: The number of incoming 
requests, indexed by time period or The number of requests rejected due to 
overload, indexed by time period. - those metrics were always zero.
I've added a pair of chronos dumps to dropbox folder.  Maybe they can shed some 
more light on the problem: 
https://www.dropbox.com/sh/qjdja9eowgvo1zc/AADm25_pwKNs3gWwBmb0Pzhpa?dl=0

BR,
Stanislav Khalup

From: Richard Whitehouse [mailto:richard.whiteho...@metaswitch.com]
Sent: Friday, September 2, 2016 5:19 PM
To: 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>;
 Stanislav Khalup <skha...@virtuozzo.com<mailto:skha...@virtuozzo.com>>
Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>>
Subject: RE: Sprout/Bono/Chronos crashes under stress test

Stanislav,

I've taken a look at the Sprout crash. It looks like you have are hitting a 
crash in the Net SNMP library we use for alarms and statistics. I've raised an 
issue to track this - https://github.com/Metaswitch/sprout/issues/1527

We've seen similar looking stacks for Bono before on multi-core VMs - e.g. 
under 
http://lists.projectclearwater.org/pipermail/clearwater_lists.projectclearwater.org/2015-January/001986.html

Historically we've scaled up Sprout and Bono by running many single or 
dual-core instances rather than running fewer larger instances - this is because
- we've seen virtualization environments impose per-VM limits on TCP connection 
counts, and obviously Bono has large numbers of TCP connections in a real-world 
scenario
- we only support a single transport thread and, since Bono performs relatively 
little processing per message, and Sprout needs to perform some processing per 
message, it is this that ends up being the bottleneck quite quickly.

Generally we've run single core Bono nodes, and dual core Sprout nodes.

Having said that, we should look into why it's crashing when you're running 
more cores.

Can you give us some description of the scenario you are running under when you 
see this?

You might also find it useful to subscribe to the mailing list so that you 
receive updates when we push out updated releases.

Thanks,

Richard

From: Clearwater [mailto:clearwater-boun...@lists.projectclearwater.org] On 
Behalf Of Stanislav Khalup
Sent: 01 September 2016 10:49
To: 
clearwater@lists.projectclearwater.org<mailto:clearwater@lists.projectclearwater.org>
Cc: Denis Plotnikov <dplotni...@virtuozzo.com<mailto:dplotni...@virtuozzo.com>>
Subject: [Project Clearwater] Sprout/Bono/Chronos crashes under stress test

Hello all,

We've been trying to perform IMS stress testing for some time now but it seems 
that we are really unlucky. When we perform sip test we experience constant 
bono/sprout crashes which affects results of our performance evaluation. The 
thing is we do know that generally our deployment is working (we managed to 
perform calls and run tests). At first we manually deployed IMS cluster but 
after crashes we decided to try all in one VM but we still experience sprout 
crashes (bono crashes are mostly fixed after setting 1CPU/1Worker). Could you 
please look at the dumps: 
https://www.dropbox.com/sh/qjdja9eowgvo1zc/AADm25_pwKNs3gWwBmb0Pzhpa?dl=0 
because for now we have no clue for what is happening.

BR,
Stanislav

_______________________________________________
Clearwater mailing list
Clearwater@lists.projectclearwater.org
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Re: [Project Clearwater] Sprout/Bono/Chronos crashes under stress test

Reply via email to