[openstack-dev] [tripleo] Pike deployment times VS Ocata

2017-12-11 Thread Justin Kilpatrick
This week on science confirms the obvious, containerized deployments
are faster, more reliable, and scale better.

Really great work everyone, TripleO deployment at 50 nodes is starting
to get boring!

Here's the report, comment if you find anything confusing or inaccessible.

https://docs.google.com/document/d/1Jy0q_DnFXL27Ftkr7W-5wtLFPlXHl3EU0B2Kgs5up2Q/edit#

Late January I'm going to take another crack at this using OVB and see
what breaks at a few hundred nodes.


- Justin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata

2017-06-09 Thread Justin Kilpatrick
On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur  wrote:
> This number of "300", does it come from your testing or from other sources?
> If the former, which driver were you using? What exactly problems have you
> seen approaching this number?

I haven't encountered this issue personally, but talking to Joe
Talerico and some operators at summit around this number a single
conductor begins to fall behind polling all of the out of band
interfaces for the machines that it's responsible for. You start to
see what you would expect from polling running behind, like incorrect
power states listed for machines and a general inability to perform
machine operations in a timely manner.

Having spent some time at the Ironic operators form this is pretty
normal and the correct response is just to scale out conductors, this
is a problem with TripleO because we don't really have a scale out
option with a single machine design. Fortunately just increasing the
time between interface polling acts as a pretty good stopgap for this
and lets Ironic catch up.

I may get some time on a cloud of that scale in the future, at which
point I will have hard numbers to give you. One of the reasons I made
YODA was the frustrating prevalence of anecdotes instead of hard data
when it came to one of the most important parts of the user
experience. If it doesn't deploy people don't use it, full stop.

> Could you please elaborate? (a bug could also help). What exactly were you
> doing?

https://bugs.launchpad.net/ironic/+bug/1680725

Describes exactly what I'm experiencing. Essentially the problem is
that nodes can and do fail to pxe, then cleaning fails and you just
lose the nodes. Users have to spend time going back and babysitting
these nodes and there's no good instructions on what to do with failed
nodes anyways. The answer is move them to manageable and then to
available at which point they go back into cleaning until it finally
works.

Like introspection was a year ago this is a cavalcade of documentation
problems and software issues. I mean really everything *works*
technically but the documentation acts like cleaning will work all the
time and so does the software, leaving the user to figure out how to
accommodate the realities of the situation without so much as a
warning that it might happen.

This comes out as more of a ux issue than a software one, but we can't
just ignore these.

- Justin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata

2017-06-08 Thread Justin Kilpatrick
Hi Emilien,

I'll try and get a list of the Perf teams TripleO deployment
bugs and bring them to the deployment hackfest.

I look forward to participating!

- Justin

On Thu, Jun 8, 2017 at 11:10 AM, Emilien Macchi <emil...@redhat.com> wrote:
> On Thu, Jun 8, 2017 at 2:21 PM, Justin Kilpatrick <jkilp...@redhat.com> wrote:
>> Morning everyone,
>>
>> I've been working on a performance testing tool for TripleO hardware
>> provisioning operations off and on for about a year now and I've been
>> using it to try and collect more detailed data about how TripleO
>> performs in scale and production use cases. Perhaps more importantly
>> YODA (Yet Openstack Deployment Tool, Another) automates the task
>> enough that days of deployment testing is a set it and forget it
>> operation.
>>
>> You can find my testing tool here [0] and the test report [1] has
>> links to raw data and visualization. Just scroll down, click the
>> capcha and click "go to kibana". I  still need to port that machine
>> from my own solution over to search guard.
>>
>> If you have too much email to consider clicking links I'll copy the
>> results summary here.
>>
>> TripleO inspection workflows have seen massive improvements from
>> Newton with a failure rate for 50 nodes with the default workflow
>> falling from 100% to <15%. Using patches slated for Pike that spurious
>> failure rate reaches zero.
>>
>> Overcloud deployments show a significant improvement of deployment
>> speed in HA and stack update tests.
>>
>> Ironic deployments in the overcloud allow the use of Ironic for bare
>> metal scale out alongside more traditional VM compute. Considering a
>> single conductor starts to struggle around 300 nodes it will be
>> difficult to push a multi conductor setup to it's limits.
>>
>> Finally Ironic node cleaning, shows a similar failure rate to
>> inspection and will require similar attention in TripleO workflows to
>> become painless.
>>
>> [0] https://review.openstack.org/#/c/384530/
>> [1] 
>> https://docs.google.com/document/d/194ww0Pi2J-dRG3-X75mphzwUZVPC2S1Gsy1V0K0PqBo/
>>
>> Thanks for your time!
>
> Hey Justin,
>
> All of this is really cool. I was wondering if you had a list of bugs
> that you've faced or reported yourself regarding to performances
> issues in TripleO.
> As you might have seen in a separate thread on openstack-dev, we're
> planning a sprint on June 21/22th to improve performances in TripleO.
> We would love your participation or someone from your team and if you
> have time before, please add the deployment-time tag to the Launchpad
> bugs that you know related to performances.
>
> Thanks a lot,
>
>> - Justin
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> --
> Emilien Macchi
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata

2017-06-08 Thread Justin Kilpatrick
Morning everyone,

I've been working on a performance testing tool for TripleO hardware
provisioning operations off and on for about a year now and I've been
using it to try and collect more detailed data about how TripleO
performs in scale and production use cases. Perhaps more importantly
YODA (Yet Openstack Deployment Tool, Another) automates the task
enough that days of deployment testing is a set it and forget it
operation.

You can find my testing tool here [0] and the test report [1] has
links to raw data and visualization. Just scroll down, click the
capcha and click "go to kibana". I  still need to port that machine
from my own solution over to search guard.

If you have too much email to consider clicking links I'll copy the
results summary here.

TripleO inspection workflows have seen massive improvements from
Newton with a failure rate for 50 nodes with the default workflow
falling from 100% to <15%. Using patches slated for Pike that spurious
failure rate reaches zero.

Overcloud deployments show a significant improvement of deployment
speed in HA and stack update tests.

Ironic deployments in the overcloud allow the use of Ironic for bare
metal scale out alongside more traditional VM compute. Considering a
single conductor starts to struggle around 300 nodes it will be
difficult to push a multi conductor setup to it's limits.

Finally Ironic node cleaning, shows a similar failure rate to
inspection and will require similar attention in TripleO workflows to
become painless.

[0] https://review.openstack.org/#/c/384530/
[1] 
https://docs.google.com/document/d/194ww0Pi2J-dRG3-X75mphzwUZVPC2S1Gsy1V0K0PqBo/

Thanks for your time!

- Justin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-19 Thread Justin Kilpatrick
More nodes is always better but I don't think we need to push the host
cloud to it's absolute limits right away. I have a list of several
pain points I expect to find with just 30ish nodes that should keep us
busy for a while.

I think the optimizations are a good idea though, especially if we
want to pave the way for the next level of this sort of effort. Which
is devs being able to ask for a 'scale ci' run on gerrit and schedule
a decent sized job for whenever it's convenient. The closer we can get
devs to large environments on demand the faster and easier these
issues can be solved.

But for now baby steps.

On Wed, Apr 19, 2017 at 12:30 PM, Ben Nemec <openst...@nemebean.com> wrote:
> TLDR: We have the capacity to do this.  One scale job can be absorbed into
> our existing test infrastructure with minimal impact.
>
>
> On 04/19/2017 07:50 AM, Flavio Percoco wrote:
>>
>> On 18/04/17 14:28 -0400, Emilien Macchi wrote:
>>>
>>> On Mon, Apr 17, 2017 at 3:52 PM, Justin Kilpatrick
>>> <jkilp...@redhat.com> wrote:
>>>>
>>>> Because CI jobs tend to max out about 5 nodes there's a whole class of
>>>> minor bugs that make it into releases.
>>>>
>>>> What happens is that they never show up in small clouds, then when
>>>> they do show up in larger testing clouds the people deploying those
>>>> simply work around the issue and get onto what they where supposed to
>>>> be testing. These workarounds do get documented/BZ'd but since they
>>>> don't block anyone and only show up in large environments they become
>>>> hard for developers to fix.
>>>>
>>>> So the issue gets stuck in limbo, with nowhere to test a patchset and
>>>> no one owning the issue.
>>>>
>>>> These issues pile up and pretty soon there is a significant difference
>>>> between the default documented workflow and the 'scale' workflow which
>>>> is filled with workarounds which may or may not be documented
>>>> upstream.
>>>>
>>>> I'd like to propose getting these issues more visibility to having a
>>>> periodic upstream job that uses 20-30 ovb instances to do a larger
>>>> deployment. Maybe at 3am on a Sunday or some other time where there's
>>>> idle execution capability to exploit. The goal being to make these
>>>> sorts of issues more visible and hopefully get better at fixing them.
>>>
>>>
>>> Wait no, I know some folks at 3am on a Saturday night who use TripleO
>>> CI (ok that was a joke).
>>
>>
>> Jokes apart, it really depends on the TZ and when you schedule it. 3:00
>> UTC on a
>> Sunday is Monday 13:00 in Sydney :) Saturdays might work better but
>> remember
>> that some countries work on Sundays.
>
>
> With the exception of the brief period where the ovb jobs were running at
> full capacity 24 hours a day, there has always been a lull in activity
> during early morning UTC.  Yes, there are people working during that time,
> but generally far fewer and the load on TripleO CI is at its lowest point.
> Honestly I'd be okay running this scale job every night, not just on the
> weekend.  A week of changes is a lot to sift through if a scaling issue
> creeps into one of the many, many projects that affect such things in
> TripleO.
>
> Also, I should note that we're not currently being constrained by absolute
> hardware limits in rh1.  The reason I haven't scaled our concurrent jobs
> higher is that there is already performance degradation when we have a full
> 70 jobs running at once.  This type of scale job would require a lot of
> theoretical resources, but those 30 compute nodes are mostly going to be
> sitting there idle while the controller(s) get deployed, so in reality their
> impact on the infrastructure is going to be less than if we just added more
> concurrent jobs that used 30 additional nodes.  And we do have the
> memory/cpu/disk to spare in rh1 to spin up more vms.
>
> We could also take advantage of heterogeneous OVB environments now so that
> the compute nodes are only 3 GB VMs instead of 8 as they are now. That would
> further reduce the impact of this sort of job.  It would require some tweaks
> to how the testenvs are created, but that shouldn't be a problem.
>
>>
>>>> To be honest I'm not sure this is the best solution, but I'm seeing
>>>> this anti pattern across several issues and I think we should try and
>>>> come up with a solution.
>>>>
>>>
>>> Yes this proposal is really cool. There is an alternative to run this
>>> periodic scenario outside 

[openstack-dev] [TripleO][CI] Bridging the production/CI workflow gap with large periodic CI jobs

2017-04-17 Thread Justin Kilpatrick
Because CI jobs tend to max out about 5 nodes there's a whole class of
minor bugs that make it into releases.

What happens is that they never show up in small clouds, then when
they do show up in larger testing clouds the people deploying those
simply work around the issue and get onto what they where supposed to
be testing. These workarounds do get documented/BZ'd but since they
don't block anyone and only show up in large environments they become
hard for developers to fix.

So the issue gets stuck in limbo, with nowhere to test a patchset and
no one owning the issue.

These issues pile up and pretty soon there is a significant difference
between the default documented workflow and the 'scale' workflow which
is filled with workarounds which may or may not be documented
upstream.

I'd like to propose getting these issues more visibility to having a
periodic upstream job that uses 20-30 ovb instances to do a larger
deployment. Maybe at 3am on a Sunday or some other time where there's
idle execution capability to exploit. The goal being to make these
sorts of issues more visible and hopefully get better at fixing them.

To be honest I'm not sure this is the best solution, but I'm seeing
this anti pattern across several issues and I think we should try and
come up with a solution.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] pingtest vs tempest

2017-04-17 Thread Justin Kilpatrick
t;>>> serve our needs, because we run TripleO CI on virtualized environment
>>>> with
>>>> very limited resources. Actually we are pretty close to full utilizing
>>>> these
>>>> resources when deploying openstack, so very little is available for
>>>> test.
>>>> It's not a problem to run tempest API tests because they are cheap -
>>>> take
>>>> little time, little resources, but also gives little coverage. Scenario
>>>> test
>>>> are more interesting and gives us more coverage, but also takes a lot of
>>>> resources (which we don't have sometimes).
>>>
>>>
>>> Sagi,
>>> In my original message I mentioned a "targeted" test, I should
>>> explained that more. We could configure the specific scenario so that
>>> the load on the virt overcloud would be minimal. Justin Kilpatrick
>>> already have Browbeat integrated with TripleO Quickstart[1], so there
>>> shouldn't be much work to try this proposed solution.
>>>
>>>>
>>>> It may be useful to run a "limited edition" of API tests that maximize
>>>> coverage and don't duplicate, for example just to check service working
>>>> basically, without covering all its functionality. It will take very
>>>> little
>>>> time (i.e. 5 tests for each service) and will give a general picture of
>>>> deployment success. It will cover fields that are not covered by
>>>> pingtest as
>>>> well.
>>>>
>>>> I think could be an option to develop a special scenario tempest tests
>>>> for
>>>> TripleO which would fit our needs.
>>>
>>>
>>> I haven't looked at Tempest in a long time, so maybe its functionality
>>> has improved. I just saw the opportunity to integrate Browbeat/Rally
>>> into CI to test the functionality of OpenStack services, while also
>>> capturing performance metrics.
>>>
>>> Joe
>>>
>>> [1]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_browbeat_tree_master_ci-2Dscripts=DwIGaQ=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=rFCQ76TW5HZUgA7b20ApVcXgXru6mvz4fvCm1_H6w1k=c7EeLf1PQSsV2XbWBhv6CWOzUFDRnDiIheN4lDKjyq8=Z0jGFw40ezDmSb3F6ns5SXRvacH6AgU0TK5dKBSRgEs=
>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Wed, Apr 5, 2017 at 11:49 PM, Emilien Macchi <emil...@redhat.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Greetings dear owls,
>>>>>
>>>>> I would like to bring back an old topic: running tempest in the gate.
>>>>>
>>>>> == Context
>>>>>
>>>>> Right now, TripleO gate is running something called pingtest to
>>>>> validate that the OpenStack cloud is working. It's an Heat stack, that
>>>>> deploys a Nova server, some volumes, a glance image, a neutron network
>>>>> and sometimes a little bit more.
>>>>> To deploy the pingtest, you obviously need Heat deployed in your
>>>>> overcloud.
>>>>>
>>>>> == Problems:
>>>>>
>>>>> Although pingtest has been very helpful over the last years:
>>>>> - easy to understand, it's an Heat template, like an OpenStack user
>>>>> would do to deploy their apps.
>>>>> - fast: the stack takes a few minutes to be created and validated
>>>>>
>>>>> It has some limitations:
>>>>> - Limitation to what Heat resources support (example: some OpenStack
>>>>> resources can't be managed from Heat)
>>>>> - Impossible to run a dynamic workflow (test a live migration for
>>>>> example)
>>>>>
>>>>> == Solutions
>>>>>
>>>>> 1) Switch pingtest to Tempest run on some specific tests, with feature
>>>>> parity of what we had with pingtest.
>>>>> For example, we could imagine to run the scenarios that deploys VM and
>>>>> boot from volume. It would test the same thing as pingtest (details
>>>>> can be discussed here).
>>>>> Each scenario would run more tests depending on the service that they
>>>>> run (scenario001 is telemetry, so it would run some tempest tests for
>>>>> Ceilometer, Aodh, Gnocchi, etc).
>>>>> We should work at making the tempest run as short as possible, and the
>>>>>

Re: [openstack-dev] [tripleo] pingtest vs tempest

2017-04-06 Thread Justin Kilpatrick
Maybe I'm getting a little off topic with this question, but why was
Tempest removed last time?

I'm not well versed in the history of this discussion, but from what I
understand Tempest in the gate has
been an off and on again thing for a while but I've never heard the
story of why it got removed.

On Thu, Apr 6, 2017 at 7:00 AM, Chris Dent  wrote:
> On Thu, 6 Apr 2017, Sagi Shnaidman wrote:
>
>> It may be useful to run a "limited edition" of API tests that maximize
>> coverage and don't duplicate, for example just to check service working
>> basically, without covering all its functionality. It will take very
>> little
>> time (i.e. 5 tests for each service) and will give a general picture of
>> deployment success. It will cover fields that are not covered by pingtest
>> as well.
>
>
> It's sound like using some parts of tempest is perhaps the desired
> thing here but in case a "limited edition" test against the APIs to
> do what amounts to a smoke test is desired, it might be worthwhile
> to investigate using gabbi[1] and its command line gabbi-run[2] tool for
> some fairly simple and readable tests that can describe a sequence
> of API interactions. There are lots of tools that can do the same
> thing, so gabbi may not be the right choice but it's there as an
> option.
>
> The telemetry group had (an may still have) some integration tests
> that use gabbi files to integrate ceilometer, heat (starting some
> vms), aodh and gnocchi and confirm that the expected flow happened.
> Since the earlier raw scripts I think there's been some integration
> with tempest, but gabbi files are still used[3].
>
> If this might be useful and I can help out, please ask.
>
> [1] http://gabbi.readthedocs.io/
> [2] http://gabbi.readthedocs.io/en/latest/runner.html
> [3]
> https://github.com/openstack/ceilometer/tree/master/ceilometer/tests/integration
>
> --
> Chris Dent ¯\_(ツ)_/¯   https://anticdent.org/
> freenode: cdent tw: @anticdent
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev