[Yahoo-eng-team] [Bug 1879878] Re: VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node
** Changed in: nova/train Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1879878 Title: VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) train series: Fix Released Status in OpenStack Compute (nova) ussuri series: Fix Released Bug description: Description === In my environmet, it will take some time to clean VM on source node in confirming resize. during confirming resize process, periodic_task update_available_resource may update resource usage at the same time. It may cause ERROR like: CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set [] during confirming resize process. Steps to reproduce == * Set /etc/nova/nova.conf "update_resources_interval" to small value, let's say 30 seconds on compute nodes. This step will increase the probability of error. * create a "dedicated" VM, the flavor can be ++--+ | Property | Value| ++--+ | OS-FLV-DISABLED:disabled | False| | OS-FLV-EXT-DATA:ephemeral | 0| | disk | 80 | | extra_specs| {"hw:cpu_policy": "dedicated"} | | id | 2be0f830-c215-4018-a96a-bee3e60b5eb1 | | name | 4vcpu.4mem.80ssd.0eph.numa | | os-flavor-access:is_public | True | | ram| 4096 | | rxtx_factor| 1.0 | | swap | | | vcpus | 4| ++--+ * Resize the VM with a new flavor to another node. * Confirm resize. Make sure it will take some time to undefine the vm on source node, 30 seconds will lead to inevitable results. * Then you will see the ERROR notice on dashboard, And the VM become ERROR Expected result === VM resized successfuly, vm state is active Actual result = * VM become ERROR * On dashboard you can see this notice: Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set []]. Environment === 1. Exact version of OpenStack you are running. Newton version with patch https://review.opendev.org/#/c/641806/21 I am sure it will happen to other new vision with https://review.opendev.org/#/c/641806/21 such as Train and Ussuri 2. Which hypervisor did you use? Libvirt + KVM 3. Which storage type did you use? local disk 4. Which networking type did you use? Neutron with OpenVSwitch Logs & Configs == ERROR log on source node 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [req-364606bb-9fa6-41db-a20e-6df9ff779934 b0887a73f3c1441686bf78944ee284d0 95262f1f45f14170b91cd8054bb36512 - - -] [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] Setting instance vm_state to ERROR 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] Traceback (most recent call last): 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6661, in _error_out_instance_on_exception 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] yield 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3444, in _confirm_resize 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] prefix='old_') 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] return f(*args, **kwargs) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py"
[Yahoo-eng-team] [Bug 1879878] [NEW] VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node
:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1542, in get_host_numa_usage_from_instance 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] host_numa_topology, instance_numa_topology, free=free)) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1409, in numa_usage_from_instances 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] newcell.unpin_cpus(pinned_cpus) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 95, in unpin_cpus 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] pinned=list(self.pinned_cpus)) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set [] 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] ** Affects: nova Importance: Undecided Assignee: kevinzhao (kego) Status: New ** Changed in: nova Assignee: (unassigned) => kevinzhao (kego) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1879878 Title: VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node Status in OpenStack Compute (nova): New Bug description: Description === In my environmet, it will take some time to clean VM on source node in confirming resize. during confirming resize process, periodic_task update_available_resource may update resource usage at the same time. It may cause ERROR like: CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set [] during confirming resize process. Steps to reproduce == * Set /etc/nova/nova.conf "update_resources_interval" to small value, let's say 30 seconds on compute nodes. This step will increase the probability of error. * create a "dedicated" VM, the flavor can be ++--+ | Property | Value| ++--+ | OS-FLV-DISABLED:disabled | False| | OS-FLV-EXT-DATA:ephemeral | 0| | disk | 80 | | extra_specs| {"hw:cpu_policy": "dedicated"} | | id | 2be0f830-c215-4018-a96a-bee3e60b5eb1 | | name | 4vcpu.4mem.80ssd.0eph.numa | | os-flavor-access:is_public | True | | ram| 4096 | | rxtx_factor| 1.0 | | swap | | | vcpus | 4| ++--+ * Resize the VM with a new flavor to another node. * Confirm resize. Make sure it will take some time to undefine the vm on source node, 30 seconds will lead to inevitable results. * Then you will see the ERROR notice on dashboard, And the VM become ERROR Expected result === VM resized successfuly, vm state is active Actual result = * VM become ERROR * On dashboard you can see this notice: Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set []]. Environment === 1. Exact version of OpenStack you are running. Newton version with patch https://review.opendev.org/#/c/641806/21 I am sure it will happen to other new vision with https://review.opendev.org/#/c/641806/21 such as Train and Ussuri 2. Which hypervisor did you use? Libvirt + KVM 3. Which storage type did you use? local disk 4. Which networking type did you use? Neutron with OpenVSwitch Logs & Configs == ERROR log on source node 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [req-364606bb-9fa6-41db-a20e-6df9ff779934 b0887a73f3c1441686bf78944ee284d0 95262f1f45f14170b91cd8054bb36512 - - -] [instance: 993138d
[Yahoo-eng-team] [Bug 1816543] Re: nova service-delete report ComputeHostNotFound when delete compute service after I delete other nova service on the same compute node
this bug has been fixed, track by https://bugs.launchpad.net/nova/+bug/1852993 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1816543 Title: nova service-delete report ComputeHostNotFound when delete compute service after I delete other nova service on the same compute node Status in OpenStack Compute (nova): Fix Released Bug description: Description === nova service-delete report ComputeHostNotFound when deleting nova-compute service after I delete other nova service(nova-consoleauth) on the same compute node. The compute_node should be removed according to the binary of the service to be deleted. When the binary of the service to be deleted is nova-compute, it is appropriate to delete the compute_node. Steps to reproduce == 1) nail1 is an all in one environment,there are nova-compute and nova-consoleauth service on host nail1 2) remove all instances on hypervisor nail1 [root@nail1 ~]# nova service-list +--+--+-+--+-+---++-+-+ | Id | Binary | Host| Zone | Status | State | Updated_at | Disabled Reason | Forced down | +--+--+-+--+-+---++-+-+ | b4ca49a8-c3a9-4fc8-b9a8-f2d662e26060 | nova-conductor | nail1 | internal | enabled | up| 2019-02-19T06:39:49.00 | - | False | | e6ae7de7-d8dc-4364-84ed-1845fe967cb6 | nova-scheduler | nail1 | internal | enabled | up| 2019-02-19T06:39:43.00 | - | False | | ea3689d5-ace1-4561-acab-369b4e067053 | nova-compute | nail1 | nova | enabled | down | 2019-02-19T06:35:41.00 | - | False | | 25da267f-9b7c-4cef-8044-9b26fc2aa18a | nova-compute | nail2 | nova | enabled | up| 2019-02-19T06:39:50.00 | - | False | | 90686f1f-6a16-4c97-af9d-bdedb9ebec7d | nova-consoleauth | nail1 | internal | enabled | down | 2019-02-19T06:37:48.00 | - | False | +--+--+-+--+-+---++-+-+ 3) delete nova-consoleauth service on nail1 [root@nail1 ~]# nova service-delete 90686f1f-6a16-4c97-af9d-bdedb9ebec7d 4) delete nova-compute service on hypervisor nail1 Actual result = [root@nail1 ~]# nova service-delete ea3689d5-ace1-4561-acab-369b4e067053 ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. (HTTP 500) (Request-ID: req-f283de97-7f00-4eae-af77-9155a7b9395d) Environment === [root@nail1 ~]# rpm -qa|grep openstack-nova-compute openstack-nova-compute-18.0.2-1.el7.noarch hypervisor: Libvirt + KVM The relevant code is as follows: nova/db/sqlalchemy/api.py @pick_context_manager_writer def service_destroy(context, service_id): service = service_get(context, service_id) model_query(context, models.Service).\ filter_by(id=service_id).\ soft_delete(synchronize_session=False) # TODO(sbauza): Remove the service_id filter in a later release # once we are sure that all compute nodes report the host field model_query(context, models.ComputeNode).\ filter(or_(models.ComputeNode.service_id == service_id, models.ComputeNode.host == service['host'])).\ soft_delete(synchronize_session=False) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1816543/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1816543] [NEW] nova service-delete report ComputeHostNotFound when delete compute service after I delete other nova service on the same compute node
Public bug reported: Description === nova service-delete report ComputeHostNotFound when deleting nova-compute service after I delete other nova service(nova-consoleauth) on the same compute node. The compute_node should be removed according to the binary of the service to be deleted. When the binary of the service to be deleted is nova-compute, it is appropriate to delete the compute_node. Steps to reproduce == 1) nail1 is an all in one environment,there are nova-compute and nova-consoleauth service on host nail1 2) remove all instances on hypervisor nail1 [root@nail1 ~]# nova service-list +--+--+-+--+-+---++-+-+ | Id | Binary | Host| Zone | Status | State | Updated_at | Disabled Reason | Forced down | +--+--+-+--+-+---++-+-+ | b4ca49a8-c3a9-4fc8-b9a8-f2d662e26060 | nova-conductor | nail1 | internal | enabled | up| 2019-02-19T06:39:49.00 | - | False | | e6ae7de7-d8dc-4364-84ed-1845fe967cb6 | nova-scheduler | nail1 | internal | enabled | up| 2019-02-19T06:39:43.00 | - | False | | ea3689d5-ace1-4561-acab-369b4e067053 | nova-compute | nail1 | nova | enabled | down | 2019-02-19T06:35:41.00 | - | False | | 25da267f-9b7c-4cef-8044-9b26fc2aa18a | nova-compute | nail2 | nova | enabled | up| 2019-02-19T06:39:50.00 | - | False | | 90686f1f-6a16-4c97-af9d-bdedb9ebec7d | nova-consoleauth | nail1 | internal | enabled | down | 2019-02-19T06:37:48.00 | - | False | +--+--+-+--+-+---++-+-+ 3) delete nova-consoleauth service on nail1 [root@nail1 ~]# nova service-delete 90686f1f-6a16-4c97-af9d-bdedb9ebec7d 4) delete nova-compute service on hypervisor nail1 Actual result = [root@nail1 ~]# nova service-delete ea3689d5-ace1-4561-acab-369b4e067053 ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. (HTTP 500) (Request-ID: req-f283de97-7f00-4eae-af77-9155a7b9395d) Environment === [root@nail1 ~]# rpm -qa|grep openstack-nova-compute openstack-nova-compute-18.0.2-1.el7.noarch hypervisor: Libvirt + KVM The relevant code is as follows: nova/db/sqlalchemy/api.py @pick_context_manager_writer def service_destroy(context, service_id): service = service_get(context, service_id) model_query(context, models.Service).\ filter_by(id=service_id).\ soft_delete(synchronize_session=False) # TODO(sbauza): Remove the service_id filter in a later release # once we are sure that all compute nodes report the host field model_query(context, models.ComputeNode).\ filter(or_(models.ComputeNode.service_id == service_id, models.ComputeNode.host == service['host'])).\ soft_delete(synchronize_session=False) ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1816543 Title: nova service-delete report ComputeHostNotFound when delete compute service after I delete other nova service on the same compute node Status in OpenStack Compute (nova): New Bug description: Description === nova service-delete report ComputeHostNotFound when deleting nova-compute service after I delete other nova service(nova-consoleauth) on the same compute node. The compute_node should be removed according to the binary of the service to be deleted. When the binary of the service to be deleted is nova-compute, it is appropriate to delete the compute_node. Steps to reproduce == 1) nail1 is an all in one environment,there are nova-compute and nova-consoleauth service on host nail1 2) remove all instances on hypervisor nail1 [root@nail1 ~]# nova service-list +--+--+-+--+-+---++-+-+ | Id | Binary | Host| Zone | Status | State | Updated_at | Disabled Reason | Forced down | +--+--+-