[Yahoo-eng-team] [Bug 1879878] Re: VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node

2022-09-01 Thread kevinzhao
** Changed in: nova/train
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1879878

Title:
  VM become Error after confirming resize with Error info
  CPUUnpinningInvalid on source node

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) train series:
  Fix Released
Status in OpenStack Compute (nova) ussuri series:
  Fix Released

Bug description:
  Description
  ===

  In my environmet, it will take some time to clean VM on source node in 
confirming resize.
  during confirming resize process, periodic_task update_available_resource may 
update resource usage at the same time.
  It may cause ERROR like: 
  CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of 
pinned CPU set []
  during confirming resize process.

  
   

  Steps to reproduce
  ==
  * Set /etc/nova/nova.conf "update_resources_interval" to small value, let's 
say 30 seconds on compute nodes. This step will increase the probability of 
error.

  * create a "dedicated" VM, the flavor can be
  ++--+
  | Property   | Value|
  ++--+
  | OS-FLV-DISABLED:disabled   | False|
  | OS-FLV-EXT-DATA:ephemeral  | 0|
  | disk   | 80   |
  | extra_specs| {"hw:cpu_policy": "dedicated"}   |
  | id | 2be0f830-c215-4018-a96a-bee3e60b5eb1 |
  | name   | 4vcpu.4mem.80ssd.0eph.numa   |
  | os-flavor-access:is_public | True |
  | ram| 4096 |
  | rxtx_factor| 1.0  |
  | swap   |  |
  | vcpus  | 4|
  ++--+

  * Resize the VM with a new flavor to another node.

  * Confirm resize. 
  Make sure it will take some time to undefine the vm on source node, 30 
seconds will lead to inevitable results.  

  * Then you will see the ERROR notice on dashboard, And the VM become
  ERROR

  
  Expected result
  ===
  VM resized successfuly, vm state is active

  
  Actual result
  =

  * VM become ERROR

  * On dashboard you can see this notice:
  Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a 
subset of pinned CPU set []].


  Environment
  ===
  1. Exact version of OpenStack you are running.

Newton version with patch https://review.opendev.org/#/c/641806/21
I am sure it will happen to other new vision with 
https://review.opendev.org/#/c/641806/21
such as Train and Ussuri

  2. Which hypervisor did you use?
 Libvirt + KVM

  3. Which storage type did you use?
 local disk

  4. Which networking type did you use?
 Neutron with OpenVSwitch

  Logs & Configs
  ==

  ERROR log on source node
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager 
[req-364606bb-9fa6-41db-a20e-6df9ff779934 b0887a73f3c1441686bf78944ee284d0 
95262f1f45f14170b91cd8054bb36512 - - -] [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] Setting instance vm_state to ERROR
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] Traceback (most recent call last):
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6661, in 
_error_out_instance_on_exception
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] yield
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3444, in 
_confirm_resize
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] prefix='old_')
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]   File 
"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in 
inner
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] return f(*args, **kwargs)
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]   File 
"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py"

[Yahoo-eng-team] [Bug 1879878] [NEW] VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node

2020-05-20 Thread kevinzhao
:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]   File 
"/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1542, in 
get_host_numa_usage_from_instance
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] host_numa_topology, 
instance_numa_topology, free=free))
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]   File 
"/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1409, in 
numa_usage_from_instances
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] newcell.unpin_cpus(pinned_cpus)
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]   File 
"/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 95, in unpin_cpus
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] pinned=list(self.pinned_cpus))
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] CPUUnpinningInvalid: CPU set to unpin [1, 
2, 18, 17] must be a subset of pinned CPU set []
2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]

** Affects: nova
     Importance: Undecided
 Assignee: kevinzhao (kego)
 Status: New

** Changed in: nova
 Assignee: (unassigned) => kevinzhao (kego)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1879878

Title:
  VM become Error after confirming resize with Error info
  CPUUnpinningInvalid on source node

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===

  In my environmet, it will take some time to clean VM on source node in 
confirming resize.
  during confirming resize process, periodic_task update_available_resource may 
update resource usage at the same time.
  It may cause ERROR like: 
  CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of 
pinned CPU set []
  during confirming resize process.

  
   

  Steps to reproduce
  ==
  * Set /etc/nova/nova.conf "update_resources_interval" to small value, let's 
say 30 seconds on compute nodes. This step will increase the probability of 
error.

  * create a "dedicated" VM, the flavor can be
  ++--+
  | Property   | Value|
  ++--+
  | OS-FLV-DISABLED:disabled   | False|
  | OS-FLV-EXT-DATA:ephemeral  | 0|
  | disk   | 80   |
  | extra_specs| {"hw:cpu_policy": "dedicated"}   |
  | id | 2be0f830-c215-4018-a96a-bee3e60b5eb1 |
  | name   | 4vcpu.4mem.80ssd.0eph.numa   |
  | os-flavor-access:is_public | True |
  | ram| 4096 |
  | rxtx_factor| 1.0  |
  | swap   |  |
  | vcpus  | 4|
  ++--+

  * Resize the VM with a new flavor to another node.

  * Confirm resize. 
  Make sure it will take some time to undefine the vm on source node, 30 
seconds will lead to inevitable results.  

  * Then you will see the ERROR notice on dashboard, And the VM become
  ERROR

  
  Expected result
  ===
  VM resized successfuly, vm state is active

  
  Actual result
  =

  * VM become ERROR

  * On dashboard you can see this notice:
  Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a 
subset of pinned CPU set []].


  Environment
  ===
  1. Exact version of OpenStack you are running.

Newton version with patch https://review.opendev.org/#/c/641806/21
I am sure it will happen to other new vision with 
https://review.opendev.org/#/c/641806/21
such as Train and Ussuri

  2. Which hypervisor did you use?
 Libvirt + KVM

  3. Which storage type did you use?
 local disk

  4. Which networking type did you use?
 Neutron with OpenVSwitch

  Logs & Configs
  ==

  ERROR log on source node
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager 
[req-364606bb-9fa6-41db-a20e-6df9ff779934 b0887a73f3c1441686bf78944ee284d0 
95262f1f45f14170b91cd8054bb36512 - - -] [instance: 
993138d

[Yahoo-eng-team] [Bug 1816543] Re: nova service-delete report ComputeHostNotFound when delete compute service after I delete other nova service on the same compute node

2020-05-19 Thread kevinzhao
this bug has been fixed, track by
https://bugs.launchpad.net/nova/+bug/1852993

** Changed in: nova
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1816543

Title:
  nova service-delete report  ComputeHostNotFound when delete compute
  service after I delete  other nova service on the same compute node

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===
  nova service-delete report  ComputeHostNotFound when deleting nova-compute 
service after I delete other nova service(nova-consoleauth) on the same compute 
node.
  The compute_node should be removed according to the binary of the service to 
be deleted. When the binary of the service to be deleted is nova-compute, it is 
appropriate to delete the compute_node.

  Steps to reproduce
  ==
  1) nail1 is an all in one environment,there are nova-compute and 
nova-consoleauth service on host nail1
  2) remove all instances on hypervisor nail1
  [root@nail1 ~]# nova service-list
  
+--+--+-+--+-+---++-+-+
  | Id   | Binary   | Host| 
Zone | Status  | State | Updated_at | Disabled Reason | 
Forced down |
  
+--+--+-+--+-+---++-+-+
  | b4ca49a8-c3a9-4fc8-b9a8-f2d662e26060 | nova-conductor   | nail1   | 
internal | enabled | up| 2019-02-19T06:39:49.00 | -   | 
False   |
  | e6ae7de7-d8dc-4364-84ed-1845fe967cb6 | nova-scheduler   | nail1   | 
internal | enabled | up| 2019-02-19T06:39:43.00 | -   | 
False   |
  | ea3689d5-ace1-4561-acab-369b4e067053 | nova-compute | nail1   | 
nova | enabled | down  | 2019-02-19T06:35:41.00 | -   | 
False   |
  | 25da267f-9b7c-4cef-8044-9b26fc2aa18a | nova-compute | nail2   | 
nova | enabled | up| 2019-02-19T06:39:50.00 | -   | 
False   |
  | 90686f1f-6a16-4c97-af9d-bdedb9ebec7d | nova-consoleauth | nail1   | 
internal | enabled | down  | 2019-02-19T06:37:48.00 | -   | 
False   |

  
+--+--+-+--+-+---++-+-+

  3) delete nova-consoleauth service on nail1
  [root@nail1 ~]# nova service-delete 90686f1f-6a16-4c97-af9d-bdedb9ebec7d

  
  4) delete nova-compute service on hypervisor nail1

  Actual result
  =
  [root@nail1 ~]# nova service-delete ea3689d5-ace1-4561-acab-369b4e067053
  ERROR (ClientException): Unexpected API Error. Please report this at 
http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
   (HTTP 500) (Request-ID: 
req-f283de97-7f00-4eae-af77-9155a7b9395d)

  
  Environment
  ===
  [root@nail1 ~]# rpm -qa|grep openstack-nova-compute
  openstack-nova-compute-18.0.2-1.el7.noarch

  hypervisor: Libvirt + KVM


  The relevant code is as follows:
  nova/db/sqlalchemy/api.py

  @pick_context_manager_writer
  def service_destroy(context, service_id):
  service = service_get(context, service_id)

  model_query(context, models.Service).\
  filter_by(id=service_id).\
  soft_delete(synchronize_session=False)

  # TODO(sbauza): Remove the service_id filter in a later release
  # once we are sure that all compute nodes report the host field
  model_query(context, models.ComputeNode).\
  filter(or_(models.ComputeNode.service_id == service_id,
 models.ComputeNode.host == service['host'])).\
  soft_delete(synchronize_session=False)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1816543/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1816543] [NEW] nova service-delete report ComputeHostNotFound when delete compute service after I delete other nova service on the same compute node

2019-02-19 Thread kevinzhao
Public bug reported:

Description
===
nova service-delete report  ComputeHostNotFound when deleting nova-compute 
service after I delete other nova service(nova-consoleauth) on the same compute 
node.
The compute_node should be removed according to the binary of the service to be 
deleted. When the binary of the service to be deleted is nova-compute, it is 
appropriate to delete the compute_node.

Steps to reproduce
==
1) nail1 is an all in one environment,there are nova-compute and 
nova-consoleauth service on host nail1
2) remove all instances on hypervisor nail1
[root@nail1 ~]# nova service-list
+--+--+-+--+-+---++-+-+
| Id   | Binary   | Host| 
Zone | Status  | State | Updated_at | Disabled Reason | 
Forced down |
+--+--+-+--+-+---++-+-+
| b4ca49a8-c3a9-4fc8-b9a8-f2d662e26060 | nova-conductor   | nail1   | 
internal | enabled | up| 2019-02-19T06:39:49.00 | -   | 
False   |
| e6ae7de7-d8dc-4364-84ed-1845fe967cb6 | nova-scheduler   | nail1   | 
internal | enabled | up| 2019-02-19T06:39:43.00 | -   | 
False   |
| ea3689d5-ace1-4561-acab-369b4e067053 | nova-compute | nail1   | 
nova | enabled | down  | 2019-02-19T06:35:41.00 | -   | 
False   |
| 25da267f-9b7c-4cef-8044-9b26fc2aa18a | nova-compute | nail2   | 
nova | enabled | up| 2019-02-19T06:39:50.00 | -   | 
False   |
| 90686f1f-6a16-4c97-af9d-bdedb9ebec7d | nova-consoleauth | nail1   | 
internal | enabled | down  | 2019-02-19T06:37:48.00 | -   | 
False   |

+--+--+-+--+-+---++-+-+

3) delete nova-consoleauth service on nail1
[root@nail1 ~]# nova service-delete 90686f1f-6a16-4c97-af9d-bdedb9ebec7d


4) delete nova-compute service on hypervisor nail1

Actual result
=
[root@nail1 ~]# nova service-delete ea3689d5-ace1-4561-acab-369b4e067053
ERROR (ClientException): Unexpected API Error. Please report this at 
http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
 (HTTP 500) (Request-ID: 
req-f283de97-7f00-4eae-af77-9155a7b9395d)


Environment
===
[root@nail1 ~]# rpm -qa|grep openstack-nova-compute
openstack-nova-compute-18.0.2-1.el7.noarch

hypervisor: Libvirt + KVM


The relevant code is as follows:
nova/db/sqlalchemy/api.py

@pick_context_manager_writer
def service_destroy(context, service_id):
service = service_get(context, service_id)

model_query(context, models.Service).\
filter_by(id=service_id).\
soft_delete(synchronize_session=False)

# TODO(sbauza): Remove the service_id filter in a later release
# once we are sure that all compute nodes report the host field
model_query(context, models.ComputeNode).\
filter(or_(models.ComputeNode.service_id == service_id,
   models.ComputeNode.host == service['host'])).\
soft_delete(synchronize_session=False)

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1816543

Title:
  nova service-delete report  ComputeHostNotFound when delete compute
  service after I delete  other nova service on the same compute node

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  nova service-delete report  ComputeHostNotFound when deleting nova-compute 
service after I delete other nova service(nova-consoleauth) on the same compute 
node.
  The compute_node should be removed according to the binary of the service to 
be deleted. When the binary of the service to be deleted is nova-compute, it is 
appropriate to delete the compute_node.

  Steps to reproduce
  ==
  1) nail1 is an all in one environment,there are nova-compute and 
nova-consoleauth service on host nail1
  2) remove all instances on hypervisor nail1
  [root@nail1 ~]# nova service-list
  
+--+--+-+--+-+---++-+-+
  | Id   | Binary   | Host| 
Zone | Status  | State | Updated_at | Disabled Reason | 
Forced down |
  
+--+--+-