Re: [openstack-dev] [nova] Rocky RC time regression analysis

2018-10-09 Thread Matt Riedemann

On 10/5/2018 6:59 PM, melanie witt wrote:
5) when live migration fails due to a internal error rollback is not 
handled correctly https://bugs.launchpad.net/nova/+bug/1788014


- Bug was reported on 2018-08-20
- The change that caused the regression landed on 2018-07-26, FF day 
https://review.openstack.org/434870

- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because sean-k-mooney was doing live migrations and found 
that when a LM failed because of a QEMU internal error, the VM remained 
ACTIVE but the VM no longer had network connectivity.

- Question: why wasn't this caught earlier?
- Answer: We would need a live migration job scenario that intentionally 
initiates and fails a live migration, then verify network connectivity 
after the rollback occurs.

- Question: can we add something like that?


Not in Tempest, no, but we could run something in the 
nova-live-migration job since that executes via its own script. We could 
hack something in like what we have proposed for testing evacuate:


https://review.openstack.org/#/c/602174/

The trick is figuring out how to introduce a fault in the destination 
host without taking down the service, because if the compute service is 
down we won't schedule to it.




6) nova-manage db online_data_migrations hangs on instances with no host 
set https://bugs.launchpad.net/nova/+bug/1788115


- Bug was reported on 2018-08-21
- The patch that introduced the bug landed on 2018-05-30 
https://review.openstack.org/567878

- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: To hit the bug, you had to have had instances with no host set 
(that failed to schedule) in your database during an upgrade. This does 
not happen during the grenade job
- Question: could we add anything to the grenade job that would leave 
some instances with no host set to cover cases like this?


Probably - I'd think creating a server on the old side with some 
parameters that we know won't schedule would do it, maybe requesting an 
AZ that doesn't exist, or some other kind of scheduler hint that we know 
won't work so we get a NoValidHost. However, online_data_migrations in 
grenade probably don't run on the cell0 database, so I'm not sure we 
would have caught that case.


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Rocky RC time regression analysis

2018-10-08 Thread Eric Fried
Mel-

I don't have much of anything useful to add here, but wanted to say
thanks for this thorough analysis. It must have taken a lot of time and
work.

Musings inline.

On 10/05/2018 06:59 PM, melanie witt wrote:
> Hey everyone,
> 
> During our Rocky retrospective discussion at the PTG [1], we talked
> about the spec freeze deadline (milestone 2, historically it had been
> milestone 1) and whether or not it was related to the hectic
> late-breaking regression RC time we had last cycle. I had an action item
> to go through the list of RC time bugs [2] and dig into each one,
> examining: when the patch that introduced the bug landed vs when the bug
> was reported, why it wasn't caught sooner, and report back so we can
> take a look together and determine whether they were related to the spec
> freeze deadline.
> 
> I used this etherpad to make notes [3], which I will [mostly] copy-paste
> here. These are all after RC1 and I'll paste them in chronological order
> of when the bug was reported.
> 
> Milestone 1 r-1 was 2018-04-19.
> Spec freeze was milestone 2 r-2 was 2018-06-07.
> Feature freeze (FF) was on 2018-07-26.
> RC1 was on 2018-08-09.
> 
> 1) Broken live migration bandwidth minimum => maximum based on neutron
> event https://bugs.launchpad.net/nova/+bug/1786346
> 
> - Bug was reported on 2018-08-09, the day of RC1
> - The patch that caused the regression landed on 2018-03-30
> https://review.openstack.org/497457
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Was found because prometheanfire was doing live migrations and noticed
> they seemed to be stuck at 1MiB/s for linuxbridge VMs
> - The bug was due to a race, so the gate didn't hit it
> - Comment on the regression bug from dansmith: "The few hacked up gate
> jobs we used to test this feature at merge time likely didn't notice the
> race because the migrations finished before the potential timeout and/or
> are on systems so loaded that the neutron event came late enough for us
> to win the race repeatedly."
> 
> 2) Docs for the zvm driver missing
> 
> - All zvm driver code changes were merged by 2018-07-17 but the
> documentation was overlooked but was noticed near RC time
> - Blueprint was approved on 2018-02-12
> 
> 3) Volume status remains "detaching" after a failure to detach a volume
> due to DeviceDetachFailed https://bugs.launchpad.net/nova/+bug/1786318
> 
> - Bug was reported on 2018-08-09, the day of RC1
> - The change that introduced the regression landed on 2018-02-21
> https://review.openstack.org/546423
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Question: why wasn't this caught earlier?
> - Answer: Unit tests were not asserting the call to the roll_detaching
> volume API. Coverage has since been added along with the bug fix
> https://review.openstack.org/590439
> 
> 4) OVB overcloud deploy fails on nova placement errors
> https://bugs.launchpad.net/nova/+bug/1787910
> 
> - Bug was reported on 2018-08-20
> - Change that caused the regression landed on 2018-07-26, FF day
> https://review.openstack.org/517921
> - Blueprint was approved on 2018-05-16
> - Was found because of a failure in the
> legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master
> CI job. The ironic-inspector CI upstream also failed because of this, as
> noted by dtantsur.
> - Question: why did it take nearly a month for the failure to be
> noticed? Is there any way we can cover this in our
> ironic-tempest-dsvm-ipa-wholedisk-bios-agent_ipmitool-tinyipa job?
> 
> 5) when live migration fails due to a internal error rollback is not
> handled correctly https://bugs.launchpad.net/nova/+bug/1788014
> 
> - Bug was reported on 2018-08-20
> - The change that caused the regression landed on 2018-07-26, FF day
> https://review.openstack.org/434870
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Was found because sean-k-mooney was doing live migrations and found
> that when a LM failed because of a QEMU internal error, the VM remained
> ACTIVE but the VM no longer had network connectivity.
> - Question: why wasn't this caught earlier?
> - Answer: We would need a live migration job scenario that intentionally
> initiates and fails a live migration, then verify network connectivity
> after the rollback occurs.
> - Question: can we add something like that?
> 
> 6) nova-manage db online_data_migrations hangs on instances with no host
> set https://bugs.launchpad.net/nova/+bug/1788115
> 
> - Bug was reported on 2018-08-21
> - The patch that introduced the bug landed on 2018-05-30
> https://review.openstack.org/567878
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Question: why wasn't this caught earlier?
> - Answer: To hit the bug, you had to have had instances with no host set
> (that failed to schedule) in your database during an upgrade. This does
> not happen during the grenade job
> - Question: could we add anything to the grenade job that would leave
> some 

[openstack-dev] [nova] Rocky RC time regression analysis

2018-10-05 Thread melanie witt

Hey everyone,

During our Rocky retrospective discussion at the PTG [1], we talked 
about the spec freeze deadline (milestone 2, historically it had been 
milestone 1) and whether or not it was related to the hectic 
late-breaking regression RC time we had last cycle. I had an action item 
to go through the list of RC time bugs [2] and dig into each one, 
examining: when the patch that introduced the bug landed vs when the bug 
was reported, why it wasn't caught sooner, and report back so we can 
take a look together and determine whether they were related to the spec 
freeze deadline.


I used this etherpad to make notes [3], which I will [mostly] copy-paste 
here. These are all after RC1 and I'll paste them in chronological order 
of when the bug was reported.


Milestone 1 r-1 was 2018-04-19.
Spec freeze was milestone 2 r-2 was 2018-06-07.
Feature freeze (FF) was on 2018-07-26.
RC1 was on 2018-08-09.

1) Broken live migration bandwidth minimum => maximum based on neutron 
event https://bugs.launchpad.net/nova/+bug/1786346


- Bug was reported on 2018-08-09, the day of RC1
- The patch that caused the regression landed on 2018-03-30 
https://review.openstack.org/497457

- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because prometheanfire was doing live migrations and noticed 
they seemed to be stuck at 1MiB/s for linuxbridge VMs

- The bug was due to a race, so the gate didn't hit it
- Comment on the regression bug from dansmith: "The few hacked up gate 
jobs we used to test this feature at merge time likely didn't notice the 
race because the migrations finished before the potential timeout and/or 
are on systems so loaded that the neutron event came late enough for us 
to win the race repeatedly."


2) Docs for the zvm driver missing

- All zvm driver code changes were merged by 2018-07-17 but the 
documentation was overlooked but was noticed near RC time

- Blueprint was approved on 2018-02-12

3) Volume status remains "detaching" after a failure to detach a volume 
due to DeviceDetachFailed https://bugs.launchpad.net/nova/+bug/1786318


- Bug was reported on 2018-08-09, the day of RC1
- The change that introduced the regression landed on 2018-02-21 
https://review.openstack.org/546423

- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: Unit tests were not asserting the call to the roll_detaching 
volume API. Coverage has since been added along with the bug fix 
https://review.openstack.org/590439


4) OVB overcloud deploy fails on nova placement errors 
https://bugs.launchpad.net/nova/+bug/1787910


- Bug was reported on 2018-08-20
- Change that caused the regression landed on 2018-07-26, FF day 
https://review.openstack.org/517921

- Blueprint was approved on 2018-05-16
- Was found because of a failure in the 
legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master 
CI job. The ironic-inspector CI upstream also failed because of this, as 
noted by dtantsur.
- Question: why did it take nearly a month for the failure to be 
noticed? Is there any way we can cover this in our 
ironic-tempest-dsvm-ipa-wholedisk-bios-agent_ipmitool-tinyipa job?


5) when live migration fails due to a internal error rollback is not 
handled correctly https://bugs.launchpad.net/nova/+bug/1788014


- Bug was reported on 2018-08-20
- The change that caused the regression landed on 2018-07-26, FF day 
https://review.openstack.org/434870

- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because sean-k-mooney was doing live migrations and found 
that when a LM failed because of a QEMU internal error, the VM remained 
ACTIVE but the VM no longer had network connectivity.

- Question: why wasn't this caught earlier?
- Answer: We would need a live migration job scenario that intentionally 
initiates and fails a live migration, then verify network connectivity 
after the rollback occurs.

- Question: can we add something like that?

6) nova-manage db online_data_migrations hangs on instances with no host 
set https://bugs.launchpad.net/nova/+bug/1788115


- Bug was reported on 2018-08-21
- The patch that introduced the bug landed on 2018-05-30 
https://review.openstack.org/567878

- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: To hit the bug, you had to have had instances with no host set 
(that failed to schedule) in your database during an upgrade. This does 
not happen during the grenade job
- Question: could we add anything to the grenade job that would leave 
some instances with no host set to cover cases like this?


7) release notes erroneously say that nova-consoleauth doesn't have to 
run in Rocky https://bugs.launchpad.net/nova/+bug/1788470


- Bug was reported on 2018-08-22
- The patches that conveyed the wrong information for the docs landed on 
2018-05-07 https://review.openstack.org/565367

- Blueprint was