[openstack-dev] [tripleo] Pike deployment times VS Ocata
This week on science confirms the obvious, containerized deployments are faster, more reliable, and scale better. Really great work everyone, TripleO deployment at 50 nodes is starting to get boring! Here's the report, comment if you find anything confusing or inaccessible. https://docs.google.com/document/d/1Jy0q_DnFXL27Ftkr7W-5wtLFPlXHl3EU0B2Kgs5up2Q/edit# Late January I'm going to take another crack at this using OVB and see what breaks at a few hundred nodes. - Justin __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata
On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsurwrote: > This number of "300", does it come from your testing or from other sources? > If the former, which driver were you using? What exactly problems have you > seen approaching this number? I haven't encountered this issue personally, but talking to Joe Talerico and some operators at summit around this number a single conductor begins to fall behind polling all of the out of band interfaces for the machines that it's responsible for. You start to see what you would expect from polling running behind, like incorrect power states listed for machines and a general inability to perform machine operations in a timely manner. Having spent some time at the Ironic operators form this is pretty normal and the correct response is just to scale out conductors, this is a problem with TripleO because we don't really have a scale out option with a single machine design. Fortunately just increasing the time between interface polling acts as a pretty good stopgap for this and lets Ironic catch up. I may get some time on a cloud of that scale in the future, at which point I will have hard numbers to give you. One of the reasons I made YODA was the frustrating prevalence of anecdotes instead of hard data when it came to one of the most important parts of the user experience. If it doesn't deploy people don't use it, full stop. > Could you please elaborate? (a bug could also help). What exactly were you > doing? https://bugs.launchpad.net/ironic/+bug/1680725 Describes exactly what I'm experiencing. Essentially the problem is that nodes can and do fail to pxe, then cleaning fails and you just lose the nodes. Users have to spend time going back and babysitting these nodes and there's no good instructions on what to do with failed nodes anyways. The answer is move them to manageable and then to available at which point they go back into cleaning until it finally works. Like introspection was a year ago this is a cavalcade of documentation problems and software issues. I mean really everything *works* technically but the documentation acts like cleaning will work all the time and so does the software, leaving the user to figure out how to accommodate the realities of the situation without so much as a warning that it might happen. This comes out as more of a ux issue than a software one, but we can't just ignore these. - Justin __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata
Hi Emilien, I'll try and get a list of the Perf teams TripleO deployment bugs and bring them to the deployment hackfest. I look forward to participating! - Justin On Thu, Jun 8, 2017 at 11:10 AM, Emilien Macchi <emil...@redhat.com> wrote: > On Thu, Jun 8, 2017 at 2:21 PM, Justin Kilpatrick <jkilp...@redhat.com> wrote: >> Morning everyone, >> >> I've been working on a performance testing tool for TripleO hardware >> provisioning operations off and on for about a year now and I've been >> using it to try and collect more detailed data about how TripleO >> performs in scale and production use cases. Perhaps more importantly >> YODA (Yet Openstack Deployment Tool, Another) automates the task >> enough that days of deployment testing is a set it and forget it >> operation. >> >> You can find my testing tool here [0] and the test report [1] has >> links to raw data and visualization. Just scroll down, click the >> capcha and click "go to kibana". I still need to port that machine >> from my own solution over to search guard. >> >> If you have too much email to consider clicking links I'll copy the >> results summary here. >> >> TripleO inspection workflows have seen massive improvements from >> Newton with a failure rate for 50 nodes with the default workflow >> falling from 100% to <15%. Using patches slated for Pike that spurious >> failure rate reaches zero. >> >> Overcloud deployments show a significant improvement of deployment >> speed in HA and stack update tests. >> >> Ironic deployments in the overcloud allow the use of Ironic for bare >> metal scale out alongside more traditional VM compute. Considering a >> single conductor starts to struggle around 300 nodes it will be >> difficult to push a multi conductor setup to it's limits. >> >> Finally Ironic node cleaning, shows a similar failure rate to >> inspection and will require similar attention in TripleO workflows to >> become painless. >> >> [0] https://review.openstack.org/#/c/384530/ >> [1] >> https://docs.google.com/document/d/194ww0Pi2J-dRG3-X75mphzwUZVPC2S1Gsy1V0K0PqBo/ >> >> Thanks for your time! > > Hey Justin, > > All of this is really cool. I was wondering if you had a list of bugs > that you've faced or reported yourself regarding to performances > issues in TripleO. > As you might have seen in a separate thread on openstack-dev, we're > planning a sprint on June 21/22th to improve performances in TripleO. > We would love your participation or someone from your team and if you > have time before, please add the deployment-time tag to the Launchpad > bugs that you know related to performances. > > Thanks a lot, > >> - Justin >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > -- > Emilien Macchi > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata
Morning everyone, I've been working on a performance testing tool for TripleO hardware provisioning operations off and on for about a year now and I've been using it to try and collect more detailed data about how TripleO performs in scale and production use cases. Perhaps more importantly YODA (Yet Openstack Deployment Tool, Another) automates the task enough that days of deployment testing is a set it and forget it operation. You can find my testing tool here [0] and the test report [1] has links to raw data and visualization. Just scroll down, click the capcha and click "go to kibana". I still need to port that machine from my own solution over to search guard. If you have too much email to consider clicking links I'll copy the results summary here. TripleO inspection workflows have seen massive improvements from Newton with a failure rate for 50 nodes with the default workflow falling from 100% to <15%. Using patches slated for Pike that spurious failure rate reaches zero. Overcloud deployments show a significant improvement of deployment speed in HA and stack update tests. Ironic deployments in the overcloud allow the use of Ironic for bare metal scale out alongside more traditional VM compute. Considering a single conductor starts to struggle around 300 nodes it will be difficult to push a multi conductor setup to it's limits. Finally Ironic node cleaning, shows a similar failure rate to inspection and will require similar attention in TripleO workflows to become painless. [0] https://review.openstack.org/#/c/384530/ [1] https://docs.google.com/document/d/194ww0Pi2J-dRG3-X75mphzwUZVPC2S1Gsy1V0K0PqBo/ Thanks for your time! - Justin __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs
More nodes is always better but I don't think we need to push the host cloud to it's absolute limits right away. I have a list of several pain points I expect to find with just 30ish nodes that should keep us busy for a while. I think the optimizations are a good idea though, especially if we want to pave the way for the next level of this sort of effort. Which is devs being able to ask for a 'scale ci' run on gerrit and schedule a decent sized job for whenever it's convenient. The closer we can get devs to large environments on demand the faster and easier these issues can be solved. But for now baby steps. On Wed, Apr 19, 2017 at 12:30 PM, Ben Nemec <openst...@nemebean.com> wrote: > TLDR: We have the capacity to do this. One scale job can be absorbed into > our existing test infrastructure with minimal impact. > > > On 04/19/2017 07:50 AM, Flavio Percoco wrote: >> >> On 18/04/17 14:28 -0400, Emilien Macchi wrote: >>> >>> On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick >>> <jkilp...@redhat.com> wrote: >>>> >>>> Because CI jobs tend to max out about 5 nodes there's a whole class of >>>> minor bugs that make it into releases. >>>> >>>> What happens is that they never show up in small clouds, then when >>>> they do show up in larger testing clouds the people deploying those >>>> simply work around the issue and get onto what they where supposed to >>>> be testing. These workarounds do get documented/BZ'd but since they >>>> don't block anyone and only show up in large environments they become >>>> hard for developers to fix. >>>> >>>> So the issue gets stuck in limbo, with nowhere to test a patchset and >>>> no one owning the issue. >>>> >>>> These issues pile up and pretty soon there is a significant difference >>>> between the default documented workflow and the 'scale' workflow which >>>> is filled with workarounds which may or may not be documented >>>> upstream. >>>> >>>> I'd like to propose getting these issues more visibility to having a >>>> periodic upstream job that uses 20-30 ovb instances to do a larger >>>> deployment. Maybe at 3am on a Sunday or some other time where there's >>>> idle execution capability to exploit. The goal being to make these >>>> sorts of issues more visible and hopefully get better at fixing them. >>> >>> >>> Wait no, I know some folks at 3am on a Saturday night who use TripleO >>> CI (ok that was a joke). >> >> >> Jokes apart, it really depends on the TZ and when you schedule it. 3:00 >> UTC on a >> Sunday is Monday 13:00 in Sydney :) Saturdays might work better but >> remember >> that some countries work on Sundays. > > > With the exception of the brief period where the ovb jobs were running at > full capacity 24 hours a day, there has always been a lull in activity > during early morning UTC. Yes, there are people working during that time, > but generally far fewer and the load on TripleO CI is at its lowest point. > Honestly I'd be okay running this scale job every night, not just on the > weekend. A week of changes is a lot to sift through if a scaling issue > creeps into one of the many, many projects that affect such things in > TripleO. > > Also, I should note that we're not currently being constrained by absolute > hardware limits in rh1. The reason I haven't scaled our concurrent jobs > higher is that there is already performance degradation when we have a full > 70 jobs running at once. This type of scale job would require a lot of > theoretical resources, but those 30 compute nodes are mostly going to be > sitting there idle while the controller(s) get deployed, so in reality their > impact on the infrastructure is going to be less than if we just added more > concurrent jobs that used 30 additional nodes. And we do have the > memory/cpu/disk to spare in rh1 to spin up more vms. > > We could also take advantage of heterogeneous OVB environments now so that > the compute nodes are only 3 GB VMs instead of 8 as they are now. That would > further reduce the impact of this sort of job. It would require some tweaks > to how the testenvs are created, but that shouldn't be a problem. > >> >>>> To be honest I'm not sure this is the best solution, but I'm seeing >>>> this anti pattern across several issues and I think we should try and >>>> come up with a solution. >>>> >>> >>> Yes this proposal is really cool. There is an alternative to run this >>> periodic scenario outside
[openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs
Because CI jobs tend to max out about 5 nodes there's a whole class of minor bugs that make it into releases. What happens is that they never show up in small clouds, then when they do show up in larger testing clouds the people deploying those simply work around the issue and get onto what they where supposed to be testing. These workarounds do get documented/BZ'd but since they don't block anyone and only show up in large environments they become hard for developers to fix. So the issue gets stuck in limbo, with nowhere to test a patchset and no one owning the issue. These issues pile up and pretty soon there is a significant difference between the default documented workflow and the 'scale' workflow which is filled with workarounds which may or may not be documented upstream. I'd like to propose getting these issues more visibility to having a periodic upstream job that uses 20-30 ovb instances to do a larger deployment. Maybe at 3am on a Sunday or some other time where there's idle execution capability to exploit. The goal being to make these sorts of issues more visible and hopefully get better at fixing them. To be honest I'm not sure this is the best solution, but I'm seeing this anti pattern across several issues and I think we should try and come up with a solution. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] pingtest vs tempest
t;>>> serve our needs, because we run TripleO CI on virtualized environment >>>> with >>>> very limited resources. Actually we are pretty close to full utilizing >>>> these >>>> resources when deploying openstack, so very little is available for >>>> test. >>>> It's not a problem to run tempest API tests because they are cheap - >>>> take >>>> little time, little resources, but also gives little coverage. Scenario >>>> test >>>> are more interesting and gives us more coverage, but also takes a lot of >>>> resources (which we don't have sometimes). >>> >>> >>> Sagi, >>> In my original message I mentioned a "targeted" test, I should >>> explained that more. We could configure the specific scenario so that >>> the load on the virt overcloud would be minimal. Justin Kilpatrick >>> already have Browbeat integrated with TripleO Quickstart[1], so there >>> shouldn't be much work to try this proposed solution. >>> >>>> >>>> It may be useful to run a "limited edition" of API tests that maximize >>>> coverage and don't duplicate, for example just to check service working >>>> basically, without covering all its functionality. It will take very >>>> little >>>> time (i.e. 5 tests for each service) and will give a general picture of >>>> deployment success. It will cover fields that are not covered by >>>> pingtest as >>>> well. >>>> >>>> I think could be an option to develop a special scenario tempest tests >>>> for >>>> TripleO which would fit our needs. >>> >>> >>> I haven't looked at Tempest in a long time, so maybe its functionality >>> has improved. I just saw the opportunity to integrate Browbeat/Rally >>> into CI to test the functionality of OpenStack services, while also >>> capturing performance metrics. >>> >>> Joe >>> >>> [1] >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_browbeat_tree_master_ci-2Dscripts=DwIGaQ=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=rFCQ76TW5HZUgA7b20ApVcXgXru6mvz4fvCm1_H6w1k=c7EeLf1PQSsV2XbWBhv6CWOzUFDRnDiIheN4lDKjyq8=Z0jGFw40ezDmSb3F6ns5SXRvacH6AgU0TK5dKBSRgEs= >>> >>>> >>>> Thanks >>>> >>>> >>>> On Wed, Apr 5, 2017 at 11:49 PM, Emilien Macchi <emil...@redhat.com> >>>> wrote: >>>>> >>>>> >>>>> Greetings dear owls, >>>>> >>>>> I would like to bring back an old topic: running tempest in the gate. >>>>> >>>>> == Context >>>>> >>>>> Right now, TripleO gate is running something called pingtest to >>>>> validate that the OpenStack cloud is working. It's an Heat stack, that >>>>> deploys a Nova server, some volumes, a glance image, a neutron network >>>>> and sometimes a little bit more. >>>>> To deploy the pingtest, you obviously need Heat deployed in your >>>>> overcloud. >>>>> >>>>> == Problems: >>>>> >>>>> Although pingtest has been very helpful over the last years: >>>>> - easy to understand, it's an Heat template, like an OpenStack user >>>>> would do to deploy their apps. >>>>> - fast: the stack takes a few minutes to be created and validated >>>>> >>>>> It has some limitations: >>>>> - Limitation to what Heat resources support (example: some OpenStack >>>>> resources can't be managed from Heat) >>>>> - Impossible to run a dynamic workflow (test a live migration for >>>>> example) >>>>> >>>>> == Solutions >>>>> >>>>> 1) Switch pingtest to Tempest run on some specific tests, with feature >>>>> parity of what we had with pingtest. >>>>> For example, we could imagine to run the scenarios that deploys VM and >>>>> boot from volume. It would test the same thing as pingtest (details >>>>> can be discussed here). >>>>> Each scenario would run more tests depending on the service that they >>>>> run (scenario001 is telemetry, so it would run some tempest tests for >>>>> Ceilometer, Aodh, Gnocchi, etc). >>>>> We should work at making the tempest run as short as possible, and the >>>>>
Re: [openstack-dev] [tripleo] pingtest vs tempest
Maybe I'm getting a little off topic with this question, but why was Tempest removed last time? I'm not well versed in the history of this discussion, but from what I understand Tempest in the gate has been an off and on again thing for a while but I've never heard the story of why it got removed. On Thu, Apr 6, 2017 at 7:00 AM, Chris Dentwrote: > On Thu, 6 Apr 2017, Sagi Shnaidman wrote: > >> It may be useful to run a "limited edition" of API tests that maximize >> coverage and don't duplicate, for example just to check service working >> basically, without covering all its functionality. It will take very >> little >> time (i.e. 5 tests for each service) and will give a general picture of >> deployment success. It will cover fields that are not covered by pingtest >> as well. > > > It's sound like using some parts of tempest is perhaps the desired > thing here but in case a "limited edition" test against the APIs to > do what amounts to a smoke test is desired, it might be worthwhile > to investigate using gabbi[1] and its command line gabbi-run[2] tool for > some fairly simple and readable tests that can describe a sequence > of API interactions. There are lots of tools that can do the same > thing, so gabbi may not be the right choice but it's there as an > option. > > The telemetry group had (an may still have) some integration tests > that use gabbi files to integrate ceilometer, heat (starting some > vms), aodh and gnocchi and confirm that the expected flow happened. > Since the earlier raw scripts I think there's been some integration > with tempest, but gabbi files are still used[3]. > > If this might be useful and I can help out, please ask. > > [1] http://gabbi.readthedocs.io/ > [2] http://gabbi.readthedocs.io/en/latest/runner.html > [3] > https://github.com/openstack/ceilometer/tree/master/ceilometer/tests/integration > > -- > Chris Dent ¯\_(ツ)_/¯ https://anticdent.org/ > freenode: cdent tw: @anticdent > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev