Public bug reported: Description =========== Able to see live migration on abort is mapping new numa topology to the instance on source & instance continue working in source, while performing hard reboot to re-calculate xml, it is using updated numa topology with cell having no resources & vm is failed to recover
Steps to reproduce [100%] ================== As part of this, compute node should have two cells or sockets each. VM flavor having below extra spec hw:mem_page_size='1048576', hw:numa_nodes='1' we need to 100 huge pages for specific flavor Before performing test, make sure source & destination will have below huge page available resources we need to move VM from source node ( from numa Node1 to Numa Node0 on compute2 ) Source: [compute1] ~# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 28672 kB Node 0 ShmemHugePages: 0 kB Node 0 FileHugePages: 0 kB Node 0 HugePages_Total: 210 Node 0 HugePages_Free: 50 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 61440 kB Node 1 ShmemHugePages: 0 kB Node 1 FileHugePages: 0 kB Node 1 HugePages_Total: 210 Node 1 HugePages_Free: 50 <Source node, our test vm will be part of numa Node 1 > Node 1 HugePages_Surp: 0 Destination: [compute-2] ~# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 28672 kB Node 0 ShmemHugePages: 0 kB Node 0 FileHugePages: 0 kB Node 0 HugePages_Total: 210 Node 0 HugePages_Free: 130 <destination node having 130 huge pages> Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 61440 kB Node 1 ShmemHugePages: 0 kB Node 1 FileHugePages: 0 kB Node 1 HugePages_Total: 210 Node 1 HugePages_Free: 50 Node 1 HugePages_Surp: 0 Before Live migration please find the numa topology details MariaDB [nova]> select numa_topology from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3'; | numa_topology | | {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["pagesize", "id"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]} | select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3'; <empty> -----END of DB----- #Trigger live migration #Apply stress inside vm to achieve migration longer # Able to see migration context is created for specific VM: MariaDB [nova]> select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3'; +----------------------------------------------------------------------------------------------------------------------------------------------------------------- | migration_context | | {"nova_object.name": "MigrationContext", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"instance_uuid": "4b115eb3-59f7-4e27-b877-2e326ef017b3", "migration_id": 283, "new_numa_topology": {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["cpuset_reserved", "id", "pcpuset", "pagesize", "cpu_pinning_raw", "cpu_policy", "cpu_thread_policy", "memory", "cpuset"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}, "old_numa_topology": {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]} old numa cell is 1, new numa cell is 0 #trigger abort Feb 13 20:59:00 cdc-appblx095-36 nova-compute[638201]: 2024-02-13 20:59:00.991 638201 ERROR nova.virt.libvirt.driver [None req-05850c05-ba5b-40ae-a37c-5ccdde8ded47 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 4b115eb3-59f7-4e27-b877-2e326ef017b3] Migration operation has aborted Post abort numa topology got updated to numa cell 0 which is part of destination | {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["cpu_thread_policy", "cpuset_reserved", "cpu_pinning_raw", "cpuset", "cpu_policy", "memory", "pagesize", "pcpuset", "id"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]} | Migration context is not deleted Expected result =============== numa topology of vm should have proper rollback with its original state & further hard reboot of vm's is failing due to no resources available on numa node Actual result ============= VM is having newer numa topology based on calculated destination numa details post abort Environment =========== Using Antelope version & Ubuntu 6.5.0-15-generic #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux ** Affects: nova Importance: Undecided Assignee: keerthivasan (keerthivassan86) Status: New ** Changed in: nova Assignee: (unassigned) => keerthivasan (keerthivassan86) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2053163 Title: VM hard reboot fails on Live Migration Abort with node having Two numa sockets Status in OpenStack Compute (nova): New Bug description: Description =========== Able to see live migration on abort is mapping new numa topology to the instance on source & instance continue working in source, while performing hard reboot to re-calculate xml, it is using updated numa topology with cell having no resources & vm is failed to recover Steps to reproduce [100%] ================== As part of this, compute node should have two cells or sockets each. VM flavor having below extra spec hw:mem_page_size='1048576', hw:numa_nodes='1' we need to 100 huge pages for specific flavor Before performing test, make sure source & destination will have below huge page available resources we need to move VM from source node ( from numa Node1 to Numa Node0 on compute2 ) Source: [compute1] ~# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 28672 kB Node 0 ShmemHugePages: 0 kB Node 0 FileHugePages: 0 kB Node 0 HugePages_Total: 210 Node 0 HugePages_Free: 50 Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 61440 kB Node 1 ShmemHugePages: 0 kB Node 1 FileHugePages: 0 kB Node 1 HugePages_Total: 210 Node 1 HugePages_Free: 50 <Source node, our test vm will be part of numa Node 1 > Node 1 HugePages_Surp: 0 Destination: [compute-2] ~# cat /sys/devices/system/node/node*/meminfo | grep -i huge Node 0 AnonHugePages: 28672 kB Node 0 ShmemHugePages: 0 kB Node 0 FileHugePages: 0 kB Node 0 HugePages_Total: 210 Node 0 HugePages_Free: 130 <destination node having 130 huge pages> Node 0 HugePages_Surp: 0 Node 1 AnonHugePages: 61440 kB Node 1 ShmemHugePages: 0 kB Node 1 FileHugePages: 0 kB Node 1 HugePages_Total: 210 Node 1 HugePages_Free: 50 Node 1 HugePages_Surp: 0 Before Live migration please find the numa topology details MariaDB [nova]> select numa_topology from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3'; | numa_topology | | {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["pagesize", "id"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]} | select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3'; <empty> -----END of DB----- #Trigger live migration #Apply stress inside vm to achieve migration longer # Able to see migration context is created for specific VM: MariaDB [nova]> select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3'; +----------------------------------------------------------------------------------------------------------------------------------------------------------------- | migration_context | | {"nova_object.name": "MigrationContext", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"instance_uuid": "4b115eb3-59f7-4e27-b877-2e326ef017b3", "migration_id": 283, "new_numa_topology": {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["cpuset_reserved", "id", "pcpuset", "pagesize", "cpu_pinning_raw", "cpu_policy", "cpu_thread_policy", "memory", "cpuset"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}, "old_numa_topology": {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]} old numa cell is 1, new numa cell is 0 #trigger abort Feb 13 20:59:00 cdc-appblx095-36 nova-compute[638201]: 2024-02-13 20:59:00.991 638201 ERROR nova.virt.libvirt.driver [None req-05850c05-ba5b-40ae-a37c-5ccdde8ded47 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 4b115eb3-59f7-4e27-b877-2e326ef017b3] Migration operation has aborted Post abort numa topology got updated to numa cell 0 which is part of destination | {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["cpu_thread_policy", "cpuset_reserved", "cpu_pinning_raw", "cpuset", "cpu_policy", "memory", "pagesize", "pcpuset", "id"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]} | Migration context is not deleted Expected result =============== numa topology of vm should have proper rollback with its original state & further hard reboot of vm's is failing due to no resources available on numa node Actual result ============= VM is having newer numa topology based on calculated destination numa details post abort Environment =========== Using Antelope version & Ubuntu 6.5.0-15-generic #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2053163/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp