[Yahoo-eng-team] [Bug 1896463] Re: evacuation failed: Port update failed : Unable to correlate PCI slot

2021-04-13 Thread Elod Illes
** Changed in: nova/ussuri
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1896463

Title:
  evacuation failed: Port update failed : Unable to correlate PCI slot

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in OpenStack Compute (nova) stein series:
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Fix Released
Status in OpenStack Compute (nova) victoria series:
  Fix Released

Bug description:
  Description
  ===
  if the _update_available_resource() of resource_tracker is called between 
_do_rebuild_instance_with_claim() and instance.save() when evacuating VM 
instances on destination host,  

  nova/compute/manager.py

  2931 def rebuild_instance(self, context, instance, orig_image_ref, 
image_ref,
  2932 +-- 84 lines: injected_files, new_pass, 
orig_sys_metadata,---
  3016 claim_ctxt = rebuild_claim(
  3017 context, instance, scheduled_node,
  3018 limits=limits, image_meta=image_meta,
  3019 migration=migration)
  3020 self._do_rebuild_instance_with_claim(
  3021 +-- 47 lines: claim_ctxt, context, instance, 
orig_image_ref,-
  3068 instance.apply_migration_context()
  3069 # NOTE (ndipanov): This save will now update the host 
and node
  3070 # attributes making sure that next RT pass is consistent 
since
  3071 # it will be based on the instance and not the migration 
DB
  3072 # entry.
  3073 instance.host = self.host
  3074 instance.node = scheduled_node
  3075 instance.save()
  3076 instance.drop_migration_context()

  the instance is not handled as managed instance of the destination
  host because it is not updated on DB yet.

  2020-09-19 07:27:36.321 8 WARNING nova.compute.resource_tracker [req-
  b35d5b9a-0786-4809-bd81-ad306cdda8d5 - - - - -] Instance
  22f6ca0e-f964-4467-83a3-f2bf12bb05ae is not being actively managed by
  this compute host but has allocations referencing this compute host:
  {u'resources': {u'MEMORY_MB': 12288, u'VCPU': 2, u'DISK_GB': 10}}.
  Skipping heal of allocation because we do not know what to do.

  And so the SRIOV ports (PCI device) was free by clean_usage()
  eventhough the VM has the VF port already.

   743 def _update_available_resource(self, context, resources):
   744 +-- 45 lines: # initialize the compute node object, creating 
it--
   789 self.pci_tracker.clean_usage(instances, migrations, orphans)
   790 dev_pools_obj = self.pci_tracker.stats.to_device_pools_obj()

  After that, evacuated this VM to another compute host again, we got
  the error like below.


  Steps to reproduce
  ==
  1. create a VM on com1 with SRIOV VF ports.
  2. stop and disable nova-compute service on com1
  3. wait 60 sec (nova-compute reporting interval)
  4. evauate the VM to com2
  5. wait the VM is active on com2
  6. enable and start nova-compute on com1
  7. wait 60 sec (nova-compute reporting interval)
  8. stop and disable nova-compute service on com2
  9. wait 60 sec (nova-compute reporting interval)
  10. evauate the VM to com1
  11. wait the VM is active on com1
  12. enable and start nova-compute on com2
  13. wait 60 sec (nova-compute reporting interval)
  14. go to step 2.

  Expected result
  ===
  Evacuation should be done without errors.

  Actual result
  =
  Evacuation failed with "Port update failed"

  Environment
  ===
  openstack-nova-compute-18.0.1-1 with SRIOV ports are used. libvirt is used.

  Logs & Configs
  ==
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager 
[req-38dd0be2-7223-4a59-8073-dd1b072125c5 c424fbb3d41f444bb7a025266fda36da 
6255a6910b9b4d3ba34a93624fe7fb22 - default default] [instance: 
22f6ca0e-f964-4467-83a3-f2bf12bb05ae] Setting instance vm_state to ERROR: 
PortUpdateFailed: Port update failed for port 
76dc33dc-5b3b-4c45-b2cb-fd59025a4dbd: Unable to correlate PCI slot :05:12.2
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 
22f6ca0e-f964-4467-83a3-f2bf12bb05ae] Traceback (most recent call last):
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 
22f6ca0e-f964-4467-83a3-f2bf12bb05ae]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7993, in 
_error_out_instance_on_exception
  2020-09-19 

[Yahoo-eng-team] [Bug 1896463] Re: evacuation failed: Port update failed : Unable to correlate PCI slot

2021-02-05 Thread Elod Illes
** Changed in: nova/victoria
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1896463

Title:
  evacuation failed: Port update failed : Unable to correlate PCI slot

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in OpenStack Compute (nova) stein series:
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  In Progress
Status in OpenStack Compute (nova) victoria series:
  Fix Released

Bug description:
  Description
  ===
  if the _update_available_resource() of resource_tracker is called between 
_do_rebuild_instance_with_claim() and instance.save() when evacuating VM 
instances on destination host,  

  nova/compute/manager.py

  2931 def rebuild_instance(self, context, instance, orig_image_ref, 
image_ref,
  2932 +-- 84 lines: injected_files, new_pass, 
orig_sys_metadata,---
  3016 claim_ctxt = rebuild_claim(
  3017 context, instance, scheduled_node,
  3018 limits=limits, image_meta=image_meta,
  3019 migration=migration)
  3020 self._do_rebuild_instance_with_claim(
  3021 +-- 47 lines: claim_ctxt, context, instance, 
orig_image_ref,-
  3068 instance.apply_migration_context()
  3069 # NOTE (ndipanov): This save will now update the host 
and node
  3070 # attributes making sure that next RT pass is consistent 
since
  3071 # it will be based on the instance and not the migration 
DB
  3072 # entry.
  3073 instance.host = self.host
  3074 instance.node = scheduled_node
  3075 instance.save()
  3076 instance.drop_migration_context()

  the instance is not handled as managed instance of the destination
  host because it is not updated on DB yet.

  2020-09-19 07:27:36.321 8 WARNING nova.compute.resource_tracker [req-
  b35d5b9a-0786-4809-bd81-ad306cdda8d5 - - - - -] Instance
  22f6ca0e-f964-4467-83a3-f2bf12bb05ae is not being actively managed by
  this compute host but has allocations referencing this compute host:
  {u'resources': {u'MEMORY_MB': 12288, u'VCPU': 2, u'DISK_GB': 10}}.
  Skipping heal of allocation because we do not know what to do.

  And so the SRIOV ports (PCI device) was free by clean_usage()
  eventhough the VM has the VF port already.

   743 def _update_available_resource(self, context, resources):
   744 +-- 45 lines: # initialize the compute node object, creating 
it--
   789 self.pci_tracker.clean_usage(instances, migrations, orphans)
   790 dev_pools_obj = self.pci_tracker.stats.to_device_pools_obj()

  After that, evacuated this VM to another compute host again, we got
  the error like below.


  Steps to reproduce
  ==
  1. create a VM on com1 with SRIOV VF ports.
  2. stop and disable nova-compute service on com1
  3. wait 60 sec (nova-compute reporting interval)
  4. evauate the VM to com2
  5. wait the VM is active on com2
  6. enable and start nova-compute on com1
  7. wait 60 sec (nova-compute reporting interval)
  8. stop and disable nova-compute service on com2
  9. wait 60 sec (nova-compute reporting interval)
  10. evauate the VM to com1
  11. wait the VM is active on com1
  12. enable and start nova-compute on com2
  13. wait 60 sec (nova-compute reporting interval)
  14. go to step 2.

  Expected result
  ===
  Evacuation should be done without errors.

  Actual result
  =
  Evacuation failed with "Port update failed"

  Environment
  ===
  openstack-nova-compute-18.0.1-1 with SRIOV ports are used. libvirt is used.

  Logs & Configs
  ==
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager 
[req-38dd0be2-7223-4a59-8073-dd1b072125c5 c424fbb3d41f444bb7a025266fda36da 
6255a6910b9b4d3ba34a93624fe7fb22 - default default] [instance: 
22f6ca0e-f964-4467-83a3-f2bf12bb05ae] Setting instance vm_state to ERROR: 
PortUpdateFailed: Port update failed for port 
76dc33dc-5b3b-4c45-b2cb-fd59025a4dbd: Unable to correlate PCI slot :05:12.2
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 
22f6ca0e-f964-4467-83a3-f2bf12bb05ae] Traceback (most recent call last):
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 
22f6ca0e-f964-4467-83a3-f2bf12bb05ae]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7993, in 
_error_out_instance_on_exception
  2020-09-19 

[Yahoo-eng-team] [Bug 1896463] Re: evacuation failed: Port update failed : Unable to correlate PCI slot

2020-09-24 Thread sean mooney
just adding the previous filed downstream redhat bug
https://bugzilla.redhat.com/show_bug.cgi?id=1852110

this can happen in queens for context so when we root cause the issue
and fix it it should like be backported to queens. tjere are other older
bugs form newton that look similar related to unshelve so its posible
that the same issue is affecting multiple move operations.

** Bug watch added: Red Hat Bugzilla #1852110
   https://bugzilla.redhat.com/show_bug.cgi?id=1852110

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/stein
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Also affects: nova/queens
   Importance: Undecided
   Status: New

** Also affects: nova/victoria
   Importance: Low
 Assignee: Balazs Gibizer (balazs-gibizer)
   Status: Confirmed

** Also affects: nova/rocky
   Importance: Undecided
   Status: New

** Changed in: nova/ussuri
   Importance: Undecided => Low

** Changed in: nova/ussuri
   Status: New => Triaged

** Changed in: nova/train
   Importance: Undecided => Low

** Changed in: nova/train
   Status: New => Triaged

** Changed in: nova/stein
   Importance: Undecided => Low

** Changed in: nova/stein
   Status: New => Triaged

** Changed in: nova/rocky
   Importance: Undecided => Low

** Changed in: nova/rocky
   Status: New => Triaged

** Changed in: nova/queens
   Importance: Undecided => Low

** Changed in: nova/queens
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1896463

Title:
  evacuation failed: Port update failed : Unable to correlate PCI slot

Status in OpenStack Compute (nova):
  Confirmed
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  Triaged
Status in OpenStack Compute (nova) stein series:
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged
Status in OpenStack Compute (nova) victoria series:
  Confirmed

Bug description:
  Description
  ===
  if the _update_available_resource() of resource_tracker is called between 
_do_rebuild_instance_with_claim() and instance.save() when evacuating VM 
instances on destination host,  

  nova/compute/manager.py

  2931 def rebuild_instance(self, context, instance, orig_image_ref, 
image_ref,
  2932 +-- 84 lines: injected_files, new_pass, 
orig_sys_metadata,---
  3016 claim_ctxt = rebuild_claim(
  3017 context, instance, scheduled_node,
  3018 limits=limits, image_meta=image_meta,
  3019 migration=migration)
  3020 self._do_rebuild_instance_with_claim(
  3021 +-- 47 lines: claim_ctxt, context, instance, 
orig_image_ref,-
  3068 instance.apply_migration_context()
  3069 # NOTE (ndipanov): This save will now update the host 
and node
  3070 # attributes making sure that next RT pass is consistent 
since
  3071 # it will be based on the instance and not the migration 
DB
  3072 # entry.
  3073 instance.host = self.host
  3074 instance.node = scheduled_node
  3075 instance.save()
  3076 instance.drop_migration_context()

  the instance is not handled as managed instance of the destination
  host because it is not updated on DB yet.

  2020-09-19 07:27:36.321 8 WARNING nova.compute.resource_tracker [req-
  b35d5b9a-0786-4809-bd81-ad306cdda8d5 - - - - -] Instance
  22f6ca0e-f964-4467-83a3-f2bf12bb05ae is not being actively managed by
  this compute host but has allocations referencing this compute host:
  {u'resources': {u'MEMORY_MB': 12288, u'VCPU': 2, u'DISK_GB': 10}}.
  Skipping heal of allocation because we do not know what to do.

  And so the SRIOV ports (PCI device) was free by clean_usage()
  eventhough the VM has the VF port already.

   743 def _update_available_resource(self, context, resources):
   744 +-- 45 lines: # initialize the compute node object, creating 
it--
   789 self.pci_tracker.clean_usage(instances, migrations, orphans)
   790 dev_pools_obj = self.pci_tracker.stats.to_device_pools_obj()

  After that, evacuated this VM to another compute host again, we got
  the error like below.


  Steps to reproduce
  ==
  1. create a VM on com1 with SRIOV VF ports.
  2. stop and disable nova-compute service on com1
  3. wait 60 sec (nova-compute reporting interval)
  4. evauate the VM to com2
  5. wait the VM is