Re: [openstack-dev] [nova] Rocky RC time regression analysis
On 10/5/2018 6:59 PM, melanie witt wrote: 5) when live migration fails due to a internal error rollback is not handled correctly https://bugs.launchpad.net/nova/+bug/1788014 - Bug was reported on 2018-08-20 - The change that caused the regression landed on 2018-07-26, FF day https://review.openstack.org/434870 - Unrelated to a blueprint, the regression was part of a bug fix - Was found because sean-k-mooney was doing live migrations and found that when a LM failed because of a QEMU internal error, the VM remained ACTIVE but the VM no longer had network connectivity. - Question: why wasn't this caught earlier? - Answer: We would need a live migration job scenario that intentionally initiates and fails a live migration, then verify network connectivity after the rollback occurs. - Question: can we add something like that? Not in Tempest, no, but we could run something in the nova-live-migration job since that executes via its own script. We could hack something in like what we have proposed for testing evacuate: https://review.openstack.org/#/c/602174/ The trick is figuring out how to introduce a fault in the destination host without taking down the service, because if the compute service is down we won't schedule to it. 6) nova-manage db online_data_migrations hangs on instances with no host set https://bugs.launchpad.net/nova/+bug/1788115 - Bug was reported on 2018-08-21 - The patch that introduced the bug landed on 2018-05-30 https://review.openstack.org/567878 - Unrelated to a blueprint, the regression was part of a bug fix - Question: why wasn't this caught earlier? - Answer: To hit the bug, you had to have had instances with no host set (that failed to schedule) in your database during an upgrade. This does not happen during the grenade job - Question: could we add anything to the grenade job that would leave some instances with no host set to cover cases like this? Probably - I'd think creating a server on the old side with some parameters that we know won't schedule would do it, maybe requesting an AZ that doesn't exist, or some other kind of scheduler hint that we know won't work so we get a NoValidHost. However, online_data_migrations in grenade probably don't run on the cell0 database, so I'm not sure we would have caught that case. -- Thanks, Matt __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] Rocky RC time regression analysis
Mel- I don't have much of anything useful to add here, but wanted to say thanks for this thorough analysis. It must have taken a lot of time and work. Musings inline. On 10/05/2018 06:59 PM, melanie witt wrote: > Hey everyone, > > During our Rocky retrospective discussion at the PTG [1], we talked > about the spec freeze deadline (milestone 2, historically it had been > milestone 1) and whether or not it was related to the hectic > late-breaking regression RC time we had last cycle. I had an action item > to go through the list of RC time bugs [2] and dig into each one, > examining: when the patch that introduced the bug landed vs when the bug > was reported, why it wasn't caught sooner, and report back so we can > take a look together and determine whether they were related to the spec > freeze deadline. > > I used this etherpad to make notes [3], which I will [mostly] copy-paste > here. These are all after RC1 and I'll paste them in chronological order > of when the bug was reported. > > Milestone 1 r-1 was 2018-04-19. > Spec freeze was milestone 2 r-2 was 2018-06-07. > Feature freeze (FF) was on 2018-07-26. > RC1 was on 2018-08-09. > > 1) Broken live migration bandwidth minimum => maximum based on neutron > event https://bugs.launchpad.net/nova/+bug/1786346 > > - Bug was reported on 2018-08-09, the day of RC1 > - The patch that caused the regression landed on 2018-03-30 > https://review.openstack.org/497457 > - Unrelated to a blueprint, the regression was part of a bug fix > - Was found because prometheanfire was doing live migrations and noticed > they seemed to be stuck at 1MiB/s for linuxbridge VMs > - The bug was due to a race, so the gate didn't hit it > - Comment on the regression bug from dansmith: "The few hacked up gate > jobs we used to test this feature at merge time likely didn't notice the > race because the migrations finished before the potential timeout and/or > are on systems so loaded that the neutron event came late enough for us > to win the race repeatedly." > > 2) Docs for the zvm driver missing > > - All zvm driver code changes were merged by 2018-07-17 but the > documentation was overlooked but was noticed near RC time > - Blueprint was approved on 2018-02-12 > > 3) Volume status remains "detaching" after a failure to detach a volume > due to DeviceDetachFailed https://bugs.launchpad.net/nova/+bug/1786318 > > - Bug was reported on 2018-08-09, the day of RC1 > - The change that introduced the regression landed on 2018-02-21 > https://review.openstack.org/546423 > - Unrelated to a blueprint, the regression was part of a bug fix > - Question: why wasn't this caught earlier? > - Answer: Unit tests were not asserting the call to the roll_detaching > volume API. Coverage has since been added along with the bug fix > https://review.openstack.org/590439 > > 4) OVB overcloud deploy fails on nova placement errors > https://bugs.launchpad.net/nova/+bug/1787910 > > - Bug was reported on 2018-08-20 > - Change that caused the regression landed on 2018-07-26, FF day > https://review.openstack.org/517921 > - Blueprint was approved on 2018-05-16 > - Was found because of a failure in the > legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master > CI job. The ironic-inspector CI upstream also failed because of this, as > noted by dtantsur. > - Question: why did it take nearly a month for the failure to be > noticed? Is there any way we can cover this in our > ironic-tempest-dsvm-ipa-wholedisk-bios-agent_ipmitool-tinyipa job? > > 5) when live migration fails due to a internal error rollback is not > handled correctly https://bugs.launchpad.net/nova/+bug/1788014 > > - Bug was reported on 2018-08-20 > - The change that caused the regression landed on 2018-07-26, FF day > https://review.openstack.org/434870 > - Unrelated to a blueprint, the regression was part of a bug fix > - Was found because sean-k-mooney was doing live migrations and found > that when a LM failed because of a QEMU internal error, the VM remained > ACTIVE but the VM no longer had network connectivity. > - Question: why wasn't this caught earlier? > - Answer: We would need a live migration job scenario that intentionally > initiates and fails a live migration, then verify network connectivity > after the rollback occurs. > - Question: can we add something like that? > > 6) nova-manage db online_data_migrations hangs on instances with no host > set https://bugs.launchpad.net/nova/+bug/1788115 > > - Bug was reported on 2018-08-21 > - The patch that introduced the bug landed on 2018-05-30 > https://review.openstack.org/567878 > - Unrelated to a blueprint, the regression was part of a bug fix > - Question: why wasn't this caught earlier? > - Answer: To hit the bug, you had to have had instances with no host set > (that failed to schedule) in your database during an upgrade. This does > not happen during the grenade job > - Question: could we add anything to the grenade job that would leave > some
[openstack-dev] [nova] Rocky RC time regression analysis
Hey everyone, During our Rocky retrospective discussion at the PTG [1], we talked about the spec freeze deadline (milestone 2, historically it had been milestone 1) and whether or not it was related to the hectic late-breaking regression RC time we had last cycle. I had an action item to go through the list of RC time bugs [2] and dig into each one, examining: when the patch that introduced the bug landed vs when the bug was reported, why it wasn't caught sooner, and report back so we can take a look together and determine whether they were related to the spec freeze deadline. I used this etherpad to make notes [3], which I will [mostly] copy-paste here. These are all after RC1 and I'll paste them in chronological order of when the bug was reported. Milestone 1 r-1 was 2018-04-19. Spec freeze was milestone 2 r-2 was 2018-06-07. Feature freeze (FF) was on 2018-07-26. RC1 was on 2018-08-09. 1) Broken live migration bandwidth minimum => maximum based on neutron event https://bugs.launchpad.net/nova/+bug/1786346 - Bug was reported on 2018-08-09, the day of RC1 - The patch that caused the regression landed on 2018-03-30 https://review.openstack.org/497457 - Unrelated to a blueprint, the regression was part of a bug fix - Was found because prometheanfire was doing live migrations and noticed they seemed to be stuck at 1MiB/s for linuxbridge VMs - The bug was due to a race, so the gate didn't hit it - Comment on the regression bug from dansmith: "The few hacked up gate jobs we used to test this feature at merge time likely didn't notice the race because the migrations finished before the potential timeout and/or are on systems so loaded that the neutron event came late enough for us to win the race repeatedly." 2) Docs for the zvm driver missing - All zvm driver code changes were merged by 2018-07-17 but the documentation was overlooked but was noticed near RC time - Blueprint was approved on 2018-02-12 3) Volume status remains "detaching" after a failure to detach a volume due to DeviceDetachFailed https://bugs.launchpad.net/nova/+bug/1786318 - Bug was reported on 2018-08-09, the day of RC1 - The change that introduced the regression landed on 2018-02-21 https://review.openstack.org/546423 - Unrelated to a blueprint, the regression was part of a bug fix - Question: why wasn't this caught earlier? - Answer: Unit tests were not asserting the call to the roll_detaching volume API. Coverage has since been added along with the bug fix https://review.openstack.org/590439 4) OVB overcloud deploy fails on nova placement errors https://bugs.launchpad.net/nova/+bug/1787910 - Bug was reported on 2018-08-20 - Change that caused the regression landed on 2018-07-26, FF day https://review.openstack.org/517921 - Blueprint was approved on 2018-05-16 - Was found because of a failure in the legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master CI job. The ironic-inspector CI upstream also failed because of this, as noted by dtantsur. - Question: why did it take nearly a month for the failure to be noticed? Is there any way we can cover this in our ironic-tempest-dsvm-ipa-wholedisk-bios-agent_ipmitool-tinyipa job? 5) when live migration fails due to a internal error rollback is not handled correctly https://bugs.launchpad.net/nova/+bug/1788014 - Bug was reported on 2018-08-20 - The change that caused the regression landed on 2018-07-26, FF day https://review.openstack.org/434870 - Unrelated to a blueprint, the regression was part of a bug fix - Was found because sean-k-mooney was doing live migrations and found that when a LM failed because of a QEMU internal error, the VM remained ACTIVE but the VM no longer had network connectivity. - Question: why wasn't this caught earlier? - Answer: We would need a live migration job scenario that intentionally initiates and fails a live migration, then verify network connectivity after the rollback occurs. - Question: can we add something like that? 6) nova-manage db online_data_migrations hangs on instances with no host set https://bugs.launchpad.net/nova/+bug/1788115 - Bug was reported on 2018-08-21 - The patch that introduced the bug landed on 2018-05-30 https://review.openstack.org/567878 - Unrelated to a blueprint, the regression was part of a bug fix - Question: why wasn't this caught earlier? - Answer: To hit the bug, you had to have had instances with no host set (that failed to schedule) in your database during an upgrade. This does not happen during the grenade job - Question: could we add anything to the grenade job that would leave some instances with no host set to cover cases like this? 7) release notes erroneously say that nova-consoleauth doesn't have to run in Rocky https://bugs.launchpad.net/nova/+bug/1788470 - Bug was reported on 2018-08-22 - The patches that conveyed the wrong information for the docs landed on 2018-05-07 https://review.openstack.org/565367 - Blueprint was