Re: [openstack-dev] [Tempest][Production] Tempest / the gate / real world load
I'm afraid I missed this topic the first time around, and I think it bears revisiting. tl;dr: I think we should consider ensuring gate stability in the face of resource-starved services by some combination of more intelligent test design and better handling of resource starvation (for example, rate-limiting). Stress-testing would be more effective if it were explicitly focused on real-world usage scenarios and run separately from the gate. I think stress-testing is about the 'when' of failure, whereas the gate is about 'if'. I don't think it can be argued that OpenStack services (especially Neutron) can do better to ensure reliability under load. Running things in parallel in the gate shone a bright light on many problem areas and that was inarguably a good thing. Now that we have a better sense of the problem, though, it may be time to think about evolving our approach. >From the perspective of gating commits, I think it makes sense to (a) minimize >gate execution time and (b) provide some guarantees of reliability under >reasonable load. I don't think either of these requires continuing to >evaluate unrealistic usage scenarios against services running in a severely >resource-starved environment. Every service eventually falls over when too >much is asked of it. These kinds of failure are not likely to be particularly >deterministic, so wouldn't it make sense to avoid triggering them in the gate >as much as possible? In the specific case of Neutron, the current approach to testing isolation involves creating and tearing down networks at a tremendous rate. I'm not sure anyone can argue that this constitutes a usage scenario that is likely to appear in production, but because it causes problems in the gate, we've had to prioritize working on it over initiatives that might prove more useful to operators. While this may have been a necessary stop on the road to Neutron stability, I think it may be worth considering whether we want the gate to continue having an outsized role in defining optimization priorities. Thoughts? m. On Dec 12, 2013, at 11:23 AM, Robert Collins wrote: > A few times now we've run into patches for devstack-gate / devstack > that change default configuration to handle 'tempest load'. > > For instance - https://review.openstack.org/61137 (Sorry Salvatore I'm > not picking on you really!) > > So there appears to be a meme that the gate is particularly stressful > - a bad environment - and that real world situations have less load. > > This could happen a few ways: (a) deployers might separate out > components more; (b) they might have faster machines; (c) they might > have less concurrent activity. > > (a) - unlikely! Deployers will cram stuff together as much as they can > to save overheads. Big clouds will have components split out - yes, > but they will also have correspondingly more load to drive that split > out. > > (b) Perhaps, but not orders of magnitude faster, the clouds we run on > are running on fairly recent hardware, and by using big instances we > don't get crammed it with that many other tenants. > > (c) Almost certainly not. Tempest currently does a maximum of four > concurrent requests. A small business cloud could easily have 5 or 6 > people making concurrent requests from time to time, and bigger but > not huge clouds will certainly have that. Their /average/ rate of API > requests may be much lower, but when they point service orchestration > tools at it -- particularly tools that walk their dependencies in > parallel - load is going to be much much higher than what we generate > with Tempest. > > tl;dr : if we need to change a config file setting in devstack-gate or > devstack *other than* setting up the specific scenario, think thrice - > should it be a production default and set in the relevant projects > default config setting. > > Cheers, > Rob > -- > Robert Collins > Distinguished Technologist > HP Converged Cloud > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Tempest][Production] Tempest / the gate / real world load
Robert, As you've deliberately picked on me I feel compelled to reply! Jokes apart, I am going to retire that patch and push the new default in neutron. Regardless of considerations on real loads vs gate loads, I think it is correct to assume the default configuration should be one that will allow gate tests to pass. A sort of maximum common denominator, if you want. I think however that the discussion on whether our gate tests are representative of real world deployment is outside the scope of this thread, even if very interesting. On the specific matter of this patch we've been noticing the CPU on the gate tests with neutron easily reaching 100%; this is not because of (b). I can indeed replicate the same behaviour on any other VM, even with twice as much vCPUs. Never tried baremetal though. However, because of the fact that 'just' the gate tests send the cpu on the single host to 100% should let us think that deployers might easily end up facing the same problem in real environment (your (a) point) regardless of how the components are split. Thankfully, Armando found out a related issue with the DHCP agent which was causing it to use a lot of cpu as well as terribly stressing ovsdbserver, and fixed it. Since then we're seeing a lot less timeout errors on the gate. Salvatore On 12 December 2013 20:23, Robert Collins wrote: > A few times now we've run into patches for devstack-gate / devstack > that change default configuration to handle 'tempest load'. > > For instance - https://review.openstack.org/61137 (Sorry Salvatore I'm > not picking on you really!) > > So there appears to be a meme that the gate is particularly stressful > - a bad environment - and that real world situations have less load. > > This could happen a few ways: (a) deployers might separate out > components more; (b) they might have faster machines; (c) they might > have less concurrent activity. > > (a) - unlikely! Deployers will cram stuff together as much as they can > to save overheads. Big clouds will have components split out - yes, > but they will also have correspondingly more load to drive that split > out. > > (b) Perhaps, but not orders of magnitude faster, the clouds we run on > are running on fairly recent hardware, and by using big instances we > don't get crammed it with that many other tenants. > > (c) Almost certainly not. Tempest currently does a maximum of four > concurrent requests. A small business cloud could easily have 5 or 6 > people making concurrent requests from time to time, and bigger but > not huge clouds will certainly have that. Their /average/ rate of API > requests may be much lower, but when they point service orchestration > tools at it -- particularly tools that walk their dependencies in > parallel - load is going to be much much higher than what we generate > with Tempest. > > tl;dr : if we need to change a config file setting in devstack-gate or > devstack *other than* setting up the specific scenario, think thrice - > should it be a production default and set in the relevant projects > default config setting. > > Cheers, > Rob > -- > Robert Collins > Distinguished Technologist > HP Converged Cloud > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [Tempest][Production] Tempest / the gate / real world load
A few times now we've run into patches for devstack-gate / devstack that change default configuration to handle 'tempest load'. For instance - https://review.openstack.org/61137 (Sorry Salvatore I'm not picking on you really!) So there appears to be a meme that the gate is particularly stressful - a bad environment - and that real world situations have less load. This could happen a few ways: (a) deployers might separate out components more; (b) they might have faster machines; (c) they might have less concurrent activity. (a) - unlikely! Deployers will cram stuff together as much as they can to save overheads. Big clouds will have components split out - yes, but they will also have correspondingly more load to drive that split out. (b) Perhaps, but not orders of magnitude faster, the clouds we run on are running on fairly recent hardware, and by using big instances we don't get crammed it with that many other tenants. (c) Almost certainly not. Tempest currently does a maximum of four concurrent requests. A small business cloud could easily have 5 or 6 people making concurrent requests from time to time, and bigger but not huge clouds will certainly have that. Their /average/ rate of API requests may be much lower, but when they point service orchestration tools at it -- particularly tools that walk their dependencies in parallel - load is going to be much much higher than what we generate with Tempest. tl;dr : if we need to change a config file setting in devstack-gate or devstack *other than* setting up the specific scenario, think thrice - should it be a production default and set in the relevant projects default config setting. Cheers, Rob -- Robert Collins Distinguished Technologist HP Converged Cloud ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev