Re: [openstack-dev] [Browbeat] proposing agopi as core

2018-07-17 Thread Joe Talerico
agopi**

On Tue, Jul 17, 2018 at 1:33 PM, Joe Talerico  wrote:

> Proposing
> ​agopi
>  as core for OpenStack Browbeat. He has been instruemntal in taking over
> the CI components of Browbeat. His contributions and reviews reflect that!
>
> Thanks!
> Joe
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Browbeat] proposing agpoi as core

2018-07-17 Thread Joe Talerico
Proposing agpoi as core for OpenStack Browbeat. He has been instruemntal in
taking over the CI components of Browbeat. His contributions and reviews
reflect that!

Thanks!
Joe
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-20 Thread Joe Talerico
On Wed, Dec 20, 2017 at 9:08 AM, Ben Nemec <openst...@nemebean.com> wrote:
>
>
> On 12/19/2017 05:34 PM, Joe Talerico wrote:
>>
>> On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins <der...@redhat.com> wrote:
>>>
>>>
>>>
>>> On 19 December 2017 at 22:23, Brian Haley <haleyb@gmail.com> wrote:
>>>>
>>>>
>>>> On 12/19/2017 04:00 PM, Ben Nemec wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 12/19/2017 02:43 PM, Brian Haley wrote:
>>>>>>
>>>>>>
>>>>>> On 12/19/2017 11:53 AM, Ben Nemec wrote:
>>>>>>>
>>>>>>>
>>>>>>> The reboot is done (mostly...see below).
>>>>>>>
>>>>>>> On 12/18/2017 05:11 PM, Joe Talerico wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Ben - Can you provide some links to the ovs port exhaustion issue
>>>>>>>> for
>>>>>>>> some background?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I don't know if we ever had a bug opened, but there's some discussion
>>>>>>> of it in
>>>>>>>
>>>>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
>>>>>>> I've also copied Derek since I believe he was the one who found it
>>>>>>> originally.
>>>>>>>
>>>>>>> The gist is that after about 3 months of tripleo-ci running in this
>>>>>>> cloud we start to hit errors creating instances because of problems
>>>>>>> creating
>>>>>>> OVS ports on the compute nodes.  Sometimes we see a huge number of
>>>>>>> ports in
>>>>>>> general, other times we see a lot of ports that look like this:
>>>>>>>
>>>>>>> Port "qvod2cade14-7c"
>>>>>>>   tag: 4095
>>>>>>>   Interface "qvod2cade14-7c"
>>>>>>>
>>>>>>> Notably they all have a tag of 4095, which seems suspicious to me.  I
>>>>>>> don't know whether it's actually an issue though.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the
>>>>>> agent.
>>>>>>
>>>>>> The 'qvo' here shows it's part of the VETH pair that os-vif created
>>>>>> when
>>>>>> it plugged in the VM (the other half is 'qvb'), and they're created so
>>>>>> that
>>>>>> iptables rules can be applied by neutron.  It's part of the "old" way
>>>>>> to do
>>>>>> security groups with the OVSHybridIptablesFirewallDriver, and can
>>>>>> eventually
>>>>>> go away once the OVSFirewallDriver can be used everywhere (requires
>>>>>> newer
>>>>>> OVS and agent).
>>>>>>
>>>>>> I wonder if you can run the ovs_cleanup utility to clean some of these
>>>>>> up?
>>>>>
>>>>>
>>>>>
>>>>> As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
>>>>> any ports that are still in use?  Or is there a different tool I'm not
>>>>> aware
>>>>> of that can do more targeted cleanup?
>>>>
>>>>
>>>>
>>>> Crap, I thought there was an option to just cleanup these dead devices,
>>>> I
>>>> should have read the code, it's either neutron ports (default) or all
>>>> ports.
>>>> Maybe that should be an option.
>>>
>>>
>>>
>>> iirc neutron-ovs-cleanup was being run following the reboot as part of a
>>> ExecStartPre= on one of the neutron services this is what essentially
>>> removed the ports for us.
>>>
>>>
>>
>> There is actually unit files for cleanup (netns|ovs|lb), specifically
>> for ovs-cleanup[1]
>>
>> Maybe this can be ran to mitigate the need for a reboot?
>
>
> That's what Brian suggested too, but running it with instances on the node
> will cause an outage because it cleans up everything, including in-use
> ports.  The reason a reboot wor

Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-19 Thread Joe Talerico
On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins <der...@redhat.com> wrote:
>
>
> On 19 December 2017 at 22:23, Brian Haley <haleyb@gmail.com> wrote:
>>
>> On 12/19/2017 04:00 PM, Ben Nemec wrote:
>>>
>>>
>>>
>>> On 12/19/2017 02:43 PM, Brian Haley wrote:
>>>>
>>>> On 12/19/2017 11:53 AM, Ben Nemec wrote:
>>>>>
>>>>> The reboot is done (mostly...see below).
>>>>>
>>>>> On 12/18/2017 05:11 PM, Joe Talerico wrote:
>>>>>>
>>>>>> Ben - Can you provide some links to the ovs port exhaustion issue for
>>>>>> some background?
>>>>>
>>>>>
>>>>> I don't know if we ever had a bug opened, but there's some discussion
>>>>> of it in
>>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
>>>>> I've also copied Derek since I believe he was the one who found it
>>>>> originally.
>>>>>
>>>>> The gist is that after about 3 months of tripleo-ci running in this
>>>>> cloud we start to hit errors creating instances because of problems 
>>>>> creating
>>>>> OVS ports on the compute nodes.  Sometimes we see a huge number of ports 
>>>>> in
>>>>> general, other times we see a lot of ports that look like this:
>>>>>
>>>>> Port "qvod2cade14-7c"
>>>>>  tag: 4095
>>>>>  Interface "qvod2cade14-7c"
>>>>>
>>>>> Notably they all have a tag of 4095, which seems suspicious to me.  I
>>>>> don't know whether it's actually an issue though.
>>>>
>>>>
>>>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.
>>>>
>>>> The 'qvo' here shows it's part of the VETH pair that os-vif created when
>>>> it plugged in the VM (the other half is 'qvb'), and they're created so that
>>>> iptables rules can be applied by neutron.  It's part of the "old" way to do
>>>> security groups with the OVSHybridIptablesFirewallDriver, and can 
>>>> eventually
>>>> go away once the OVSFirewallDriver can be used everywhere (requires newer
>>>> OVS and agent).
>>>>
>>>> I wonder if you can run the ovs_cleanup utility to clean some of these
>>>> up?
>>>
>>>
>>> As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
>>> any ports that are still in use?  Or is there a different tool I'm not aware
>>> of that can do more targeted cleanup?
>>
>>
>> Crap, I thought there was an option to just cleanup these dead devices, I
>> should have read the code, it's either neutron ports (default) or all ports.
>> Maybe that should be an option.
>
>
> iirc neutron-ovs-cleanup was being run following the reboot as part of a
> ExecStartPre= on one of the neutron services this is what essentially
> removed the ports for us.
>
>

There is actually unit files for cleanup (netns|ovs|lb), specifically
for ovs-cleanup[1]

Maybe this can be ran to mitigate the need for a reboot?

[1]
[Unit]
Description=OpenStack Neutron Open vSwitch Cleanup Utility
After=syslog.target network.target openvswitch.service
Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service
neutron-l3-agent.service openstack-nova-compute.service

[Service]
Type=oneshot
User=neutron
ExecStart=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
ExecStop=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
PrivateTmp=true
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
~
>>
>>
>>
>> -Brian
>>
>>
>>> Oh, also worth noting that I don't think we have os-vif in this cloud
>>> because it's so old.  There's no os-vif package installed anyway.
>>>
>>>>
>>>> -Brian
>>>>
>>>>> I've had some offline discussions about getting someone on this cloud
>>>>> to debug the probl

Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-18 Thread Joe Talerico
Ben - Can you provide some links to the ovs port exhaustion issue for
some background?

Thanks,
Joe

On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec  wrote:
> Hi,
>
> It's that magical time again.  You know the one, when we reboot rh1 to avoid
> OVS port exhaustion. :-)
>
> If all goes well you won't even notice that this is happening, but there is
> the possibility that a few jobs will fail while the te-broker host is
> rebooted so I wanted to let everyone know.  If you notice anything else
> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know.  I have
> been known to forget to restart services after the reboot.
>
> I'll send a followup when I'm done.
>
> -Ben
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Looking for help in properly configuring a TripleO environment

2017-10-09 Thread Joe Talerico
If you ssh to your compute node can you ping 172.16.0.14?

On Mon, Oct 9, 2017 at 11:54 AM, Mark Hamzy  wrote:
> I am looking for help in properly configuring a TripleO environment on a
> machine with two network cards talking to two baremetal nodes in the
> overcloud also with two network cards.  One network will be for provisioning
> and one will be for internet connection.  I have documented my current
> configuration at:
>
> https://fedoraproject.org/wiki/User:Hamzy/TripleO_mixed_undercloud_overcloud_try8
>
>
> 2017-09-23 23:54:49Z [overcloud.ControllerAllNodesValidationDeployment.0]:
> CREATE_COMPLETE  state changed
> 2017-09-23 23:54:49Z [overcloud.ControllerAllNodesValidationDeployment]:
> CREATE_COMPLETE  Stack CREATE completed successfully
> 2017-09-23 23:54:49Z [overcloud.ControllerAllNodesValidationDeployment]:
> CREATE_COMPLETE  state changed
> 2017-09-24 00:05:06Z [overcloud.ComputeAllNodesValidationDeployment.0]:
> SIGNAL_IN_PROGRESS  Signal: deployment d54b96a6-1860-4802-a$
> 45-db4ece0317e4 failed (1)
> 2017-09-24 00:05:06Z [overcloud.ComputeAllNodesValidationDeployment.0]:
> CREATE_FAILED  Error: resources[0]: Deployment to server fa$led:
> deploy_status_code : Deployment exited with non-zero status code: 1
> 2017-09-24 00:05:06Z [overcloud.ComputeAllNodesValidationDeployment]:
> CREATE_FAILED  Resource CREATE failed: Error: resources[0]: D$ployment to
> server failed: deploy_status_code : Deployment exited with non-zero status
> code: 1
> 2017-09-24 00:05:07Z [overcloud.ComputeAllNodesValidationDeployment]:
> CREATE_FAILED  Error:
> resources.ComputeAllNodesValidationDepl$yment.resources[0]: Deployment to
> server failed: deploy_status_code: Deployment exited with non-zero status
> code: 1
> 2017-09-24 00:05:07Z [overcloud]: CREATE_FAILED  Resource CREATE failed:
> Error: resources.ComputeAllNodesValidationDeployment.resou$ces[0]:
> Deployment to server failed: deploy_status_code: Deployment exited with
> non-zero status code: 1
>
>  Stack overcloud CREATE_FAILED
>
> overcloud.ComputeAllNodesValidationDeployment.0:
>   resource_type: OS::Heat::StructuredDeployment
>   physical_resource_id: d54b96a6-1860-4802-ad45-db4ece0317e4
>   status: CREATE_FAILED
>   status_reason: |
> Error: resources[0]: Deployment to server failed: deploy_status_code :
> Deployment exited with non-zero status code: 1
>   deploy_stdout: |
> ...
> Ping to 172.16.0.14 failed. Retrying...
> Ping to 172.16.0.14 failed. Retrying...
> Ping to 172.16.0.14 failed. Retrying...
> Ping to 172.16.0.14 failed. Retrying...
> Ping to 172.16.0.14 failed. Retrying...
> Ping to 172.16.0.14 failed. Retrying...
> Ping to 172.16.0.14 failed. Retrying...
> Ping to 172.16.0.14 failed. Retrying...
> Ping to 172.16.0.14 failed. Retrying...
> FAILURE
> (truncated, view all with --long)
>   deploy_stderr: |
> 172.16.0.14 is not pingable. Local Network: 172.16.0.0/24
>
> Burried somewhat in that URL was the following way to view my templates:
>
> cp -r /usr/share/openstack-tripleo-heat-templates templates
> (cd templates/; wget --quiet -O -
> https://hamzy.fedorapeople.org/openstack-tripleo-heat-templates.patch| patch
> -p1)
>
> The controller and compute do install and get IP address from DHCP:
>
> [hamzy@overcloud-controller-0 ~]$ ip a
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: enP1p3s0f0:  mtu 1500 qdisc mq portid
> 98be9454b240 state DOWN qlen 1000
> link/ether 98:be:94:54:b2:40 brd ff:ff:ff:ff:ff:ff
> 3: enP1p3s0f1:  mtu 1500 qdisc mq portid
> 98be9454b242 state DOWN qlen 1000
> link/ether 98:be:94:54:b2:42 brd ff:ff:ff:ff:ff:ff
> 4: enP1p5s0f0:  mtu 1500 qdisc mq portid
> 98be94546360 state DOWN qlen 1000
> link/ether 98:be:94:54:63:60 brd ff:ff:ff:ff:ff:ff
> 5: enP1p5s0f1:  mtu 1500 qdisc mq portid
> 98be94546362 state DOWN qlen 1000
> link/ether 98:be:94:54:63:62 brd ff:ff:ff:ff:ff:ff
> 6: enP3p11s0f0:  mtu 1500 qdisc mq portid
> 40f2e9316940 state DOWN qlen 1000
> link/ether 40:f2:e9:31:69:40 brd ff:ff:ff:ff:ff:ff
> 7: enP3p11s0f1:  mtu 1500 qdisc mq portid
> 40f2e9316942 state DOWN qlen 1000
> link/ether 40:f2:e9:31:69:42 brd ff:ff:ff:ff:ff:ff
> 8: enP6p1s0f0:  mtu 1500 qdisc mq portid
> 98be94541f80 state DOWN qlen 1000
> link/ether 98:be:94:54:1f:80 brd ff:ff:ff:ff:ff:ff
> 9: enP6p1s0f1:  mtu 1500 qdisc mq portid
> 98be94541f82 state DOWN qlen 1000
> link/ether 98:be:94:54:1f:82 brd 

Re: [openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata

2017-06-13 Thread Joe Talerico
On Fri, Jun 9, 2017 at 7:28 AM, Justin Kilpatrick <jkilp...@redhat.com> wrote:
> On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur <dtant...@redhat.com> wrote:
>> This number of "300", does it come from your testing or from other sources?
>> If the former, which driver were you using? What exactly problems have you
>> seen approaching this number?
>
> I haven't encountered this issue personally, but talking to Joe
> Talerico and some operators at summit around this number a single
> conductor begins to fall behind polling all of the out of band
> interfaces for the machines that it's responsible for. You start to
> see what you would expect from polling running behind, like incorrect
> power states listed for machines and a general inability to perform
> machine operations in a timely manner.
>
> Having spent some time at the Ironic operators form this is pretty
> normal and the correct response is just to scale out conductors, this
> is a problem with TripleO because we don't really have a scale out
> option with a single machine design. Fortunately just increasing the
> time between interface polling acts as a pretty good stopgap for this
> and lets Ironic catch up.
>
> I may get some time on a cloud of that scale in the future, at which
> point I will have hard numbers to give you. One of the reasons I made
> YODA was the frustrating prevalence of anecdotes instead of hard data
> when it came to one of the most important parts of the user
> experience. If it doesn't deploy people don't use it, full stop.
>
>> Could you please elaborate? (a bug could also help). What exactly were you
>> doing?
>
> https://bugs.launchpad.net/ironic/+bug/1680725

Additionally, I would like to see more verbose output from the
cleaning : https://bugs.launchpad.net/ironic/+bug/1670893

>
> Describes exactly what I'm experiencing. Essentially the problem is
> that nodes can and do fail to pxe, then cleaning fails and you just
> lose the nodes. Users have to spend time going back and babysitting
> these nodes and there's no good instructions on what to do with failed
> nodes anyways. The answer is move them to manageable and then to
> available at which point they go back into cleaning until it finally
> works.
>
> Like introspection was a year ago this is a cavalcade of documentation
> problems and software issues. I mean really everything *works*
> technically but the documentation acts like cleaning will work all the
> time and so does the software, leaving the user to figure out how to
> accommodate the realities of the situation without so much as a
> warning that it might happen.
>
> This comes out as more of a ux issue than a software one, but we can't
> just ignore these.
>
> - Justin
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata

2017-06-13 Thread Joe Talerico
On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur  wrote:
> On 06/08/2017 02:21 PM, Justin Kilpatrick wrote:
>>
>> Morning everyone,
>>
>> I've been working on a performance testing tool for TripleO hardware
>> provisioning operations off and on for about a year now and I've been
>> using it to try and collect more detailed data about how TripleO
>> performs in scale and production use cases. Perhaps more importantly
>> YODA (Yet Openstack Deployment Tool, Another) automates the task
>> enough that days of deployment testing is a set it and forget it
>> operation. >
>> You can find my testing tool here [0] and the test report [1] has
>> links to raw data and visualization. Just scroll down, click the
>> capcha and click "go to kibana". I  still need to port that machine
>> from my own solution over to search guard.
>>
>> If you have too much email to consider clicking links I'll copy the
>> results summary here.
>>
>> TripleO inspection workflows have seen massive improvements from
>> Newton with a failure rate for 50 nodes with the default workflow
>> falling from 100% to <15%. Using patches slated for Pike that spurious
>> failure rate reaches zero.
>
>
> \o/
>
>>
>> Overcloud deployments show a significant improvement of deployment
>> speed in HA and stack update tests.
>>
>> Ironic deployments in the overcloud allow the use of Ironic for bare
>> metal scale out alongside more traditional VM compute. Considering a
>> single conductor starts to struggle around 300 nodes it will be
>> difficult to push a multi conductor setup to it's limits.
>
>
> This number of "300", does it come from your testing or from other sources?

Dmitry - The "300" comes from my testing on different environments.

Most recently, here is what I saw at CNCF -
https://snapshot.raintank.io/dashboard/snapshot/Sp2wuk2M5adTpqfXMJenMXcSlCav2PiZ

The undercloud was "idle" during this period.

> If the former, which driver were you using?

pxe_ipmitool.

> What exactly problems have you seen approaching this number?

I would have to restart ironic-conductor before every scale-up, which
here is what ironic-conductor looks like after a restart
https://snapshot.raintank.io/dashboard/snapshot/Im3AxP6qUfMnTeB97kryUcQV6otY0bHP
. Without restarting ironic, the scale up would fail due to ironic (I
do not have the exact error we would encounter documented).

>
>>
>> Finally Ironic node cleaning, shows a similar failure rate to
>> inspection and will require similar attention in TripleO workflows to
>> become painless.
>
>
> Could you please elaborate? (a bug could also help). What exactly were you
> doing?
>
>>
>> [0] https://review.openstack.org/#/c/384530/
>> [1]
>> https://docs.google.com/document/d/194ww0Pi2J-dRG3-X75mphzwUZVPC2S1Gsy1V0K0PqBo/
>>
>> Thanks for your time!
>
>
> Thanks for YOUR time, this work is extremely valuable!
>
>
>>
>> - Justin
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] A proposal for hackathon to reduce deploy time of TripleO

2017-05-23 Thread Joe Talerico
On Tue, May 23, 2017 at 6:47 AM, Sagi Shnaidman  wrote:
> Hi, all
>
> I'd like to propose an idea to make one or two days hackathon in TripleO
> project with main goal - to reduce deployment time of TripleO.
>
> - How could it be arranged?
>
> We can arrange a separate IRC channel and Bluejeans video conference session
> for hackathon in these days to create a "presence" feeling.
>
> - How to participate and contribute?
>
> We'll have a few responsibility fields like tripleo-quickstart, containers,
> storage, HA, baremetal, etc - the exact list should be ready before the
> hackathon so that everybody could assign to one of these "teams". It's good
> to have somebody in team to be stakeholder and responsible for organization
> and tasks.
>
> - What is the goal?
>
> The goal of this hackathon to reduce deployment time of TripleO as much as
> possible.
>
> For example part of CI team takes a task to reduce quickstart tasks time. It
> includes statistics collection, profiling and detection of places to
> optimize. After this tasks are created, patches are tested and submitted.
>
> The prizes will be presented to teams which saved most of time :)
>
> What do you think?

Sounds like a great idea! Looking forward to contributing! Lets go
ahead and add this one :
https://bugs.launchpad.net/tripleo/+bug/1671859

;-)

>
> Thanks
> --
> Best regards
> Sagi Shnaidman
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo][logging] oslo.log fluentd native logging

2017-05-15 Thread Joe Talerico
On Wed, May 10, 2017 at 5:41 PM, Dan Prince <dpri...@redhat.com> wrote:
> On Mon, 2017-04-24 at 07:47 -0400, Joe Talerico wrote:
>> Hey owls - I have been playing with oslo.log fluentd integration[1]
>> in
>> a poc commit here [2]. Enabling the native service logging is nice
>> and
>> tracebacks no longer multiple inserts into elastic - there is a
>> "traceback" key which would contain the traceback if there was one.
>>
>> The system-level / kernel level logging is still needed with the
>> fluent client on each Overcloud node.
>>
>> I see Martin did the initial work [3] to integrate fluentd, is there
>> anyone looking at migrating the OpenStack services to using the
>> oslo.log facility?
>
> Nobody officially implementing this yet that I know of. But it does
> look promising.
>
> The idea of using oslo.logs fluentd formatter could dovetail very
> nicely into our new containers (docker) servers for Pike in that it
> would allow us to log to stdout directly within the container... but
> still support the Fluentd logging interfaces that we have today.

Right, I think we give the user the option for oslo.log fluentd for
OpenStack services. We will still need fluentd to send the other noise
-- kernel/rabbit/etc

>
> The only downside would be that not all services in OpenStack support
> olso.log (I don't think Swift does for example). Nor do some of the
> core services we deploy like Galera and RabbitMQ. So we'd have a mixed
> bag of host and stdout logging perhaps for some things or would need to
> integrate with Fluentd differently for services without oslo.log
> support.

Yeah, this is the downside...

>
> Our current approach to containers logging in TripleO recently landed
> here and exposed the logs to a directory on the host specifically so
> that we could aim to support Fluentd integrations:
>
> https://review.openstack.org/#/c/442603/
>
> Perhaps we should revisit this in the (near) future to improve our
> containers deployments.
>
> Dan

I think oslo.log fluentd fluentd work shouldn't be much to integrate,
which could give the container work something to play with sooner than
later.

Who from the ops-tools side could I work with on this -- or maybe
people don't see this as a high enough priority?

Joe

>
>>
>> Joe
>>
>> [1] https://github.com/openstack/oslo.log/blob/master/oslo_log/format
>> ters.py#L167
>> [2] https://review.openstack.org/#/c/456760/
>> [3] https://specs.openstack.org/openstack/tripleo-specs/specs/newton/
>> tripleo-opstools-centralized-logging.html
>>
>> _
>> _
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubs
>> cribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo][logging] oslo.log fluentd native logging

2017-04-24 Thread Joe Talerico
Hey owls - I have been playing with oslo.log fluentd integration[1] in
a poc commit here [2]. Enabling the native service logging is nice and
tracebacks no longer multiple inserts into elastic - there is a
"traceback" key which would contain the traceback if there was one.

The system-level / kernel level logging is still needed with the
fluent client on each Overcloud node.

I see Martin did the initial work [3] to integrate fluentd, is there
anyone looking at migrating the OpenStack services to using the
oslo.log facility?

Joe

[1] 
https://github.com/openstack/oslo.log/blob/master/oslo_log/formatters.py#L167
[2] https://review.openstack.org/#/c/456760/
[3] 
https://specs.openstack.org/openstack/tripleo-specs/specs/newton/tripleo-opstools-centralized-logging.html

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] pingtest vs tempest

2017-04-05 Thread Joe Talerico
On Wed, Apr 5, 2017 at 4:49 PM, Emilien Macchi  wrote:
> Greetings dear owls,
>
> I would like to bring back an old topic: running tempest in the gate.
>
> == Context
>
> Right now, TripleO gate is running something called pingtest to
> validate that the OpenStack cloud is working. It's an Heat stack, that
> deploys a Nova server, some volumes, a glance image, a neutron network
> and sometimes a little bit more.
> To deploy the pingtest, you obviously need Heat deployed in your overcloud.
>
> == Problems:
>
> Although pingtest has been very helpful over the last years:
> - easy to understand, it's an Heat template, like an OpenStack user
> would do to deploy their apps.
> - fast: the stack takes a few minutes to be created and validated
>
> It has some limitations:
> - Limitation to what Heat resources support (example: some OpenStack
> resources can't be managed from Heat)
> - Impossible to run a dynamic workflow (test a live migration for example)
>
> == Solutions
>
> 1) Switch pingtest to Tempest run on some specific tests, with feature
> parity of what we had with pingtest.
> For example, we could imagine to run the scenarios that deploys VM and
> boot from volume. It would test the same thing as pingtest (details
> can be discussed here).
> Each scenario would run more tests depending on the service that they
> run (scenario001 is telemetry, so it would run some tempest tests for
> Ceilometer, Aodh, Gnocchi, etc).
> We should work at making the tempest run as short as possible, and the
> close as possible from what we have with a pingtest.
>
> 2) Run custom scripts in TripleO CI tooling, called from the pingtest
> (heat template), that would run some validations commands (API calls,
> etc).
> It has been investigated in the past but never implemented AFIK.
>
> 3) ?

Browbeat isn't a "validation" tool, like Tempest, however we have
Browbeat integrated in some CI systems already. We could create a very
targeted test to sniff out any issues with the Cloud.

>
> I tried to make this text short and go straight to the point, please
> bring feedback now. I hope we can make progress on $topic during Pike,
> so we can increase our testing coverage and detect deployment issues
> sooner.
>
> Thanks,
> --
> Emilien Macchi
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo][heat] Heat memory usage in the TripleO gate: Ocata edition

2017-03-16 Thread Joe Talerico
On Wed, Mar 15, 2017 at 5:53 PM, Zane Bitter <zbit...@redhat.com> wrote:
> On 15/03/17 15:52, Joe Talerico wrote:
>>
>> Can we start looking at CPU usage as well? Not sure if your data has
>> this as well...
>
>
> Usage by Heat specifically? Or just in general?

heat-engine specifically.

>
> We're limited by what is logged in the gate, so CPU usage by Heat is
> definitely a non-starter. Picking a random gate run, the Heat memory use
> comes from this file:
>
> http://logs.openstack.org/27/445627/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/9232979/logs/ps.txt.gz
>
> which is generated by running `ps` at the end of the test.
>
> We also have this file (including historical data) from dstat:
>
> http://logs.openstack.org/27/445627/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/9232979/logs/dstat.txt.gz
>
> so there is _some_ data there, it's mostly a question of how to process it
> down to something we can plot against time. My first guess would be to do a
> box-and-whisker-style plot showing the distribution of the 1m load average
> during the test. (CPU usage itself is generally a pretty bad measure of...
> CPU usage.) What problems are you hoping to catch?

Just curiosity.

We have a set of tools which capture per-process utilization for
cpu/mem/disk/etc. I wonder if we could implement this into your work?

Joe

>
> cheers,
> Zane.
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo][heat] Heat memory usage in the TripleO gate: Ocata edition

2017-03-15 Thread Joe Talerico
On Tue, Mar 14, 2017 at 4:06 PM, Zane Bitter  wrote:
> Following up on the previous thread:
>
> http://lists.openstack.org/pipermail/openstack-dev/2017-January/109748.html
>
> Here is the latest data, which includes the Ocata release:
>
> https://fedorapeople.org/~zaneb/tripleo-memory/20170314/heat_memused.png
>
> As you can see, there has been one jump in memory usage. This was due to the
> TripleO patch https://review.openstack.org/#/c/425717/
>
> Unlike previous increases in memory usage, I was able to warn of this one in
> the advance, and it was deemed an acceptable trade-off. The reasons for the
> increase are unknown - the addition of more stuff to the endpoint map seemed
> like a good bet, but one attempt to mitigate that[1] had no effect and I'm
> increasingly unconvinced that this could account for the magnitude of the
> increase.
>
> In any event, memory usage remains around the 1GiB level, none of the other
> complexity increases during Ocata have had any discernible effect, and Heat
> has had no memory usage regressions.
>
> Stay tuned for the next exciting edition, in which I try to figure out how
> to do more than 3 colors on the plot.

Nice work Zane! Thanks for this!

Can we start looking at CPU usage as well? Not sure if your data has
this as well...

Thanks!
Joe

>
> cheers,
> Zane.
>
>
> [1] https://review.openstack.org/#/c/427836/
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Performance] PTG?

2017-01-07 Thread Joe Talerico
Hey Andrey - Is there a shared etherpad for the Rally/Performance days?

Thanks,
Joe

On Wed, Jan 4, 2017 at 11:01 AM, Andrey Kurilin <akuri...@mirantis.com> wrote:
> Hi, Joe!
>
> It is not a mistake. After a talk with Dina B., we decided to extend Rally
> session for the wider
> audience and I requested "Rally & Performance team" session.
>
> On Wed, Jan 4, 2017 at 5:29 PM, Joe Talerico <jtale...@redhat.com> wrote:
>>
>> When I signed up to attend the PTG, Performance was not listed as a
>> option, however on the website it clearly shows Performance is
>> Monday-Tuesday.
>>
>> Is this just a mistake in the event website?
>>
>> Joe
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> --
> Best regards,
> Andrey Kurilin.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Performance] PTG?

2017-01-04 Thread Joe Talerico
When I signed up to attend the PTG, Performance was not listed as a
option, however on the website it clearly shows Performance is
Monday-Tuesday.

Is this just a mistake in the event website?

Joe

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Browbeat] Nominate Justin Kilpatrick as core.

2016-11-15 Thread Joe Talerico
Alright, I have 1, +1... Any one else (I think Alex +1ed on IRC)?

Since I nominated him obviously I +1!

Joe

On Tue, Nov 8, 2016 at 9:24 AM, Sindhur Malleni <small...@redhat.com> wrote:
> +1. I think Justin's done some awesome work and would be a great addition to
> the core team. Better get reviewing some patches Justin! ;)
>
> On Fri, Oct 28, 2016 at 10:28 AM, Joe Talerico <jtale...@redhat.com> wrote:
>>
>> Justin has been doing a great deal of work on the Browbeat-CI and
>> stabilizing our code.
>>
>> I would like to nominate Justin as our first core who didn't begin as
>> a core to the project!
>>
>> Joe
>
>
>
>
> --
> Sai Sindhur Malleni
> Software Engineer
> Red Hat Inc.
> 100 East Davie Street
> Raleigh, NC, USA
> Work: (919) 754-4557 | Cell: (919) 985-1055

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo][ironic] introspection and CI

2016-10-18 Thread Joe Talerico
With large sets of nodes to introspect we typically avoid using the
bulk introspection. I have written a quick script that introspects a
couple nodes at a time:
https://gist.github.com/jtaleric/fcca3811cd4d8f37336f9532e5b9c9ff

Maybe we can add this sort of logic to bulk introspection, with some retries?

On Tue, Oct 18, 2016 at 8:29 AM, John Trowbridge  wrote:
>
>
> On 10/18/2016 07:20 AM, Wesley Hayutin wrote:
>> See my response inline.
>>
>> On Tue, Oct 18, 2016 at 6:07 AM, Dmitry Tantsur  wrote:
>>
>>> On 10/17/2016 11:10 PM, Wesley Hayutin wrote:
>>>
 Greetings,

 The RDO CI team is considering adding retries to our calls to
 introspection
 again [1].
 This is very handy for bare metal environments where retries may be
 needed due
 to random chaos in the environment itself.

 We're trying to balance two things here..
 1. reduce the number of false negatives in CI
 2. try not to overstep what CI should vs. what the product should do.

 We would like to hear your comments if you think this is acceptable for
 CI or if
 this may be overstepping.

 Thank you


 [1] http://paste.openstack.org/show/586035/

>>>
>>> Hi!
>>>
>>> I probably lack some context of what exactly problems you face. I don't
>>> have any disagreement with retrying it, just want to make sure we're not
>>> missing actual bugs.
>>>
>>
>> I agree, we have to be careful not to paper over bugs while we try to
>> overcome typical environmental delays that come w/ booting, rebooting $x
>> number of random hardware nodes.
>> To make this a little more crystal clear, I'm trying to determine is where
>> progressive delays and retries should be injected into the workflow of
>> deploying an overcloud.
>> Should we add options in the product itself that allow for $x number of
>> retries w/ a configurable set of delays for introspection? [2]  Is the
>> expectation this works the first time everytime?
>> Are we overstepping what CI should do by implementing [1].
>
> IMO, yes, we are overstepping what CI should be doing with [1]. Mostly
> because we are providing a better UX in CI than an actual user will get.
>>
>> Additionally would it be appropriate to implement [1], while [2] is
>> developed for the next release and is it OK to use [1] with older releases?
>>
>
> However, I think it is ok to implement [1] in CI, if the following are true:
>
> 1) There is an in progress bug to make this UX better for non-CI user.
> 2) For older releases if said bug is deemed inappropriate for backport.
>
>> Thanks for your time and responses.
>>
>>
>> [1] http://paste.openstack.org/show/586035/
>> [2]
>> https://github.com/openstack/tripleo-common/blob/master/workbooks/baremetal.yaml#L169
>>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Suggestions for OOO

2016-10-11 Thread Joe Talerico
On Tue, Oct 11, 2016 at 6:38 AM, Dmitry Tantsur <dtant...@redhat.com> wrote:
> On 10/11/2016 02:39 AM, Joe Talerico wrote:
>>
>> Hey all,
>> The past couple of days I have making comments on IRC to discuss some
>> of the issues I have bumped into when scaling Newton to > 30 compute
>> nodes.
>>
>> - `bulk import`, the operation to go from enroll -> manage can take
>> 20-30 minutes to complete. Can we have this be a non-blocking
>> operation with a message to the user that they cannot continue until
>> the nodes they want to deploy on go from enroll->manage?
>
>
> The only thing that enroll->manage does is to check the power credentials.
> It should never take more than 30-60 seconds (and even this is too much, and
> might be a sign of problems with the environment). I suspect that the
> workflow processes nodes sequentially, though, hence these 30-60 seconds
> multiply by the number of nodes. If so, the workflow definitely needs
> fixing.

Yeah, it seems to be sequentially, and I did have 2 nodes that failed
to go from enroll->manage, which could slow things down even more.

>
>> - overcloud deploy - when pxe completes I have seen a hand-full of
>> nodes not reboot, or just get jammed up in the pxe screen. When this
>> occurs I run:
>> $ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
>> '{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
>> node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
>> awk '{print $2}') | awk '{print $2}') off ; fi; done
>> # (192 is the first octet)
>> - Then -
>> $ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
>> '{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
>> node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
>> awk '{print $2}') | awk '{print $2}') on ; fi; done
>>
>> This typically fixes the deployment so things can continue, however it
>> would be great to have this type of logic added to OOO, where if a
>> node goes from BUILD->ACTIVE, if it isn't reachable in 120 seconds,
>> ironic simply reboots the host..
>
>
> Unfortunately, it's hard to define "reachable". Also 120 seconds is way too
> little for some servers, it can well take them 5 minutes to boot.

Sure, 120 was just a shot in the dark to start the conversation, we
need to establish some sort of timeout.

>
> I would rather figure out why PXE gets stuck on your environment. Maybe you
> need a firmware update.

The issue is that things are inconsistent and across multiple
platforms. I have seen this on Dell, HP and Supermicro -- and while
one deployment fails, if I re-try the deployment it works.

>
>>
>> Also, I suggest if the second attempt fails, reschedule the host --
>> sometimes I have seen where a raid controller or something goes bad
>> out of our control.
>
>
> We do have reschedule in place, but I suspect the current Ironic timeout (1
> hour?) is too large for Nova.

Possibly? I didn't think it would reschedule if the nodes goes from
build->active.. For example, the user is going through a install, the
PXE went through, reboots and now the raid battery is dead, or
something with the raid controller went fubar, the user is afk and
doesn't see this, his deployment fails.

I am not suggestion we handle every corner case, but the situations
above happen to often to ignore.


>
>>
>> Thanks for listening!
>> rook
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo] Suggestions for OOO

2016-10-10 Thread Joe Talerico
Hey all,
The past couple of days I have making comments on IRC to discuss some
of the issues I have bumped into when scaling Newton to > 30 compute
nodes.

- `bulk import`, the operation to go from enroll -> manage can take
20-30 minutes to complete. Can we have this be a non-blocking
operation with a message to the user that they cannot continue until
the nodes they want to deploy on go from enroll->manage?
- overcloud deploy - when pxe completes I have seen a hand-full of
nodes not reboot, or just get jammed up in the pxe screen. When this
occurs I run:
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') off ; fi; done
# (192 is the first octet)
- Then -
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') on ; fi; done

This typically fixes the deployment so things can continue, however it
would be great to have this type of logic added to OOO, where if a
node goes from BUILD->ACTIVE, if it isn't reachable in 120 seconds,
ironic simply reboots the host..

Also, I suggest if the second attempt fails, reschedule the host --
sometimes I have seen where a raid controller or something goes bad
out of our control.

Thanks for listening!
rook

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [E] [TripleO] scripts to do post deployment analysis of an overcloud

2016-08-03 Thread Joe Talerico
On Wed, Jul 27, 2016 at 2:04 AM, Hugh Brock  wrote:
> On Jul 26, 2016 8:08 PM, "Gordon, Kent" 
> wrote:
>>
>>
>>
>>
>>
>>
>> > -Original Message-
>> > From: Gonéri Le Bouder [mailto:gon...@lebouder.net]
>> > Sent: Tuesday, July 26, 2016 12:24 PM
>> > To: openstack-dev@lists.openstack.org
>> > Subject: [E] [openstack-dev] [TripleO] scripts to do post deployment
>> > analysis
>> > of an overcloud
>> >
>> > Hi all,
>> >
>> > For the Distributed-CI[0] project, we did two scripts[1] that we use to
>> > extract
>>
>> Links not included in message
>>
>> > information from an overcloud.
>> > We use this information to improve the readability of the deployment
>> > logs.
>> > I attached an example to show how we use the extracted stack
>> > information.
>> >
>> > Now my question, do you know some other tools that we can use to do this
>> > kind of anaylsis?
>> > --
>> > Gonéri Le Bouder
>>
>> Kent S. Gordon
>
> Joe, any overlap with Browbeat here?
>
> -Hugh

Hey Hugh- Not from what I can tell... Any reason this tool couldn't be
built into the openstack-api?

This seems to be looking at the heat information? Browbeat will create
a ansible inventory and login to a Overcloud deployment and check on
settings and the cluster vs looking at heat.

Joe

>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Consolidating TripleO validations with Browbeat validations

2016-06-20 Thread Joe Talerico
On Mon, Jun 20, 2016 at 12:41 PM, Ihar Hrachyshka <ihrac...@redhat.com> wrote:
>
>> On 20 Jun 2016, at 18:37, Joe Talerico <jtale...@redhat.com> wrote:
>>
>> Hello - It would seem there is a little bit of overlap with TripleO
>> validations ( clapper validations ) and Browbeat *Checks*. I would
>> like to see these two come together, and I wanted to get some feedback
>> on this.
>>
>> For reference here are the Browbeat checks :
>> https://github.com/openstack/browbeat/tree/master/ansible/check
>>
>> We check for common deployment mistakes, possible deployment
>> performance issues and some bugs that could impact the scale and
>> performance of your cloud... At the end we build a report of found
>> issues with the cloud, like :
>> https://github.com/openstack/browbeat/blob/master/ansible/check/browbeat-example-bug_report.log
>>
>> We eventually wanted to take these findings and push them to
>> ElasticSearch as metadata for our result data (just so we would be
>> aware of any BZs or possibly missed tuning).
>>
>> Anyhoo, I just would like to get feedback on consolidating these
>> checks into TripleO Validations if that makes sense. If this does make
>> sense, who could I work with to see that this happens?
>
> Sorry for hijacking the thread somewhat, but it seems that 
> neutron-sanity-check would cover for some common deployment issues, if 
> utilized by projects like browbeat. Has anyone considered the tool?
>
> http://docs.openstack.org/cli-reference/neutron-sanity-check.html
>
> If there are projects that are interested in integrating checks that are 
> implemented by neutron community, we would be glad to give some guidance.
>
> Ihar

Hey Ihar - the TripleO validations are using this :
https://github.com/rthallisey/clapper/blob/0881300a815f8b801a38d117b8d01b42a00c7f7b/ansible-tests/validations/neutron-sanity-check.yaml

> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] Consolidating TripleO validations with Browbeat validations

2016-06-20 Thread Joe Talerico
Hello - It would seem there is a little bit of overlap with TripleO
validations ( clapper validations ) and Browbeat *Checks*. I would
like to see these two come together, and I wanted to get some feedback
on this.

For reference here are the Browbeat checks :
https://github.com/openstack/browbeat/tree/master/ansible/check

We check for common deployment mistakes, possible deployment
performance issues and some bugs that could impact the scale and
performance of your cloud... At the end we build a report of found
issues with the cloud, like :
https://github.com/openstack/browbeat/blob/master/ansible/check/browbeat-example-bug_report.log

We eventually wanted to take these findings and push them to
ElasticSearch as metadata for our result data (just so we would be
aware of any BZs or possibly missed tuning).

Anyhoo, I just would like to get feedback on consolidating these
checks into TripleO Validations if that makes sense. If this does make
sense, who could I work with to see that this happens?

Thanks
Joe

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev