[Yahoo-eng-team] [Bug 1981814] [NEW] swap_volume: Maybe happen IO error or lose user data if the task failed
Public bug reported: Description === The task of swap_volume is a general and important function for instances and in-use volumes. The whole process consists of 3 steps in nova: * first: connect new volume to libvirt guest(instance is using old volume); * second: copy or rebase old volume data to new volume(instance is using old volume); * third: update volumes states in cinder and block_device_mapping in nova (instance is using new volume); But the exception handler is too simple: roll-back will be excuted if any exception happened in any step and the actual volume used was ingored. the roll-back operation is to disconnect new volume and delete new attachment. Clearly, a exception raised in the third step, we can't do roll-back and should continue to complete the task if the exception is not fatal. otherwise Input/Output error will happen while user read or write the disk, and user data maybe lose if the data write to new volume but was roll-back. Steps to reproduce == 1. create an instance and attach a available volume to it: $ openstack server create my-vm --flavor m1.medium --image --network $ openstack volume create my-vol --type --size 100 $ openstack server add volume my-vm my-vol 2. enter my-vm, make file system and mount /dev/vdc, then read-write the /dev/vdc $ mkfs.ext4 /dev/vdc $ mount /dev/vdc /mnt $ touch /mnt/test $ fio -rw=randrw -ioengine=libaio -bs=4K -size=20G -filename=/mnt/test ... 3. retype the volume: $ openstack volume set my-vol --type --retype-policy on-demand 4. Some accidents cause nova disconnect old volume failed in third step after the second step is finished successfully, and the task finally failed. 5. fio can't read or write file /mnt/test. Expected result === After exception happened in step 4, the disk should normally read and write. Actual result = Just as step 5, user can't read and write disk. Environment === 1. nova version: 22.0.1 2. hypervisor: Libvirt+Qemu 2. Storage: ceph, FC-San, LVM 3. network: Neutron + ovs Logs & Configs == ** Affects: nova Importance: Undecided Status: Confirmed ** Changed in: nova Status: New => Confirmed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1981814 Title: swap_volume: Maybe happen IO error or lose user data if the task failed Status in OpenStack Compute (nova): Confirmed Bug description: Description === The task of swap_volume is a general and important function for instances and in-use volumes. The whole process consists of 3 steps in nova: * first: connect new volume to libvirt guest(instance is using old volume); * second: copy or rebase old volume data to new volume(instance is using old volume); * third: update volumes states in cinder and block_device_mapping in nova (instance is using new volume); But the exception handler is too simple: roll-back will be excuted if any exception happened in any step and the actual volume used was ingored. the roll-back operation is to disconnect new volume and delete new attachment. Clearly, a exception raised in the third step, we can't do roll-back and should continue to complete the task if the exception is not fatal. otherwise Input/Output error will happen while user read or write the disk, and user data maybe lose if the data write to new volume but was roll-back. Steps to reproduce == 1. create an instance and attach a available volume to it: $ openstack server create my-vm --flavor m1.medium --image --network $ openstack volume create my-vol --type --size 100 $ openstack server add volume my-vm my-vol 2. enter my-vm, make file system and mount /dev/vdc, then read-write the /dev/vdc $ mkfs.ext4 /dev/vdc $ mount /dev/vdc /mnt $ touch /mnt/test $ fio -rw=randrw -ioengine=libaio -bs=4K -size=20G -filename=/mnt/test ... 3. retype the volume: $ openstack volume set my-vol --type --retype-policy on-demand 4. Some accidents cause nova disconnect old volume failed in third step after the second step is finished successfully, and the task finally failed. 5. fio can't read or write file /mnt/test. Expected result === After exception happened in step 4, the disk should normally read and write. Actual result = Just as step 5, user can't read and write disk. Environment === 1. nova version: 22.0.1 2. hypervisor: Libvirt+Qemu 2. Storage: ceph, FC-San, LVM 3. network: Neutron + ovs Logs & Configs == To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1981814/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe :
[Yahoo-eng-team] [Bug 1949051] [NEW] nova compute service running IronicDriver maybe leak memory
Public bug reported: Description === We run nova-compute service with IronicDriver in k8s cluster as statefulSet pod, with 1GiB memory limit and only this service in POD. There are about 40 nodes in our test environment. Most of them have instances and in active provision state. Some nodes fail to connect to the IPMI. As a result, they cannot obtain the power status. In about 12 hours, the memory limit is exceeded and the POD is restarted. Steps to reproduce == Nothing need to do. Note the flowing: 1. The more nodes there are, the faster the memory grows and the shorter the time limit is exceeded. 2. Even with only one node, the memory limit will be exceeded, but for a long time. 3. In our environment, the frequency of memory growth is around 10min, so we suspect that is caused by periodic task, maybe `_sync_power_states` task. 4. I am not sure whether the IPMI connection has any impact. Expected result === Memory of the pod should be in a stable state when we are not performing operations on nodes/instances. Actual result = Memory keeps increasing until the limit is exceeded and the POD is restarted. Environment === openstack version - nova: 22.0.1 - ironic: 16-0-1 Logs & Configs == ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1949051 Title: nova compute service running IronicDriver maybe leak memory Status in OpenStack Compute (nova): New Bug description: Description === We run nova-compute service with IronicDriver in k8s cluster as statefulSet pod, with 1GiB memory limit and only this service in POD. There are about 40 nodes in our test environment. Most of them have instances and in active provision state. Some nodes fail to connect to the IPMI. As a result, they cannot obtain the power status. In about 12 hours, the memory limit is exceeded and the POD is restarted. Steps to reproduce == Nothing need to do. Note the flowing: 1. The more nodes there are, the faster the memory grows and the shorter the time limit is exceeded. 2. Even with only one node, the memory limit will be exceeded, but for a long time. 3. In our environment, the frequency of memory growth is around 10min, so we suspect that is caused by periodic task, maybe `_sync_power_states` task. 4. I am not sure whether the IPMI connection has any impact. Expected result === Memory of the pod should be in a stable state when we are not performing operations on nodes/instances. Actual result = Memory keeps increasing until the limit is exceeded and the POD is restarted. Environment === openstack version - nova: 22.0.1 - ironic: 16-0-1 Logs & Configs == To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1949051/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1938400] [NEW] compute service for ironic instance is not high availability
Public bug reported: Description === The ironic instance will not be managed by nova if the service of itself is down,even the environment has multiple compute host and has up status service. Steps to reproduce === 1. deploy multiple compute service with IronicDriver, call them svc1,svc2,svc3; 2. enroll a baremetal node in ironic, and the node enroll a hypervisor in nova, assume the host of hypervisor is svc1's. 3. create a running baremetal instance using nova compute on the baremetal node; 4. at this time, we can manage this ironic instance with nova, like power on/off; 5. make compute service of the svc1 to down; 6. now, we can't show hypervisor info of this node, and can't power on/off this instance. Expected result === We have 3 compute service, while the svc1 was down, the others should can manage this instance. Actual result === We can't do anything for this instance. Environment === all versions of nova. Other === This is because the IronicDriver flow the libvirt logical, but the ironic compute service only has the management duty, do not need to create instance like libvirt for virtual machine. ** Affects: nova Importance: Undecided Status: New ** Description changed: - # Description + Description + === The ironic instance will not be managed by nova if the service of itself is down,even the environment has multiple compute host and has up status service. - # Steps to reproduce + Steps to reproduce + === 1. deploy multiple compute service with IronicDriver, call them svc1,svc2,svc3; 2. enroll a baremetal node in ironic, and the node enroll a hypervisor in nova, assume the host of hypervisor is svc1's. 3. create a running baremetal instance using nova compute on the baremetal node; 4. at this time, we can manage this ironic instance with nova, like power on/off; 5. make compute service of the svc1 to down; 6. now, we can't show hypervisor info of this node, and can't power on/off this instance. - # Expected result + Expected result + === We have 3 compute service, while the svc1 was down, the others should can manage this instance. - # Actual result + Actual result + === We can't do anything for this instance. - # Environment + Environment + === all versions of nova. - # Other + Other + === This is because the IronicDriver flow the libvirt logical, but the ironic compute service only has the management duty, do not need to create instance like libvirt for virtual machine. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1938400 Title: compute service for ironic instance is not high availability Status in OpenStack Compute (nova): New Bug description: Description === The ironic instance will not be managed by nova if the service of itself is down,even the environment has multiple compute host and has up status service. Steps to reproduce === 1. deploy multiple compute service with IronicDriver, call them svc1,svc2,svc3; 2. enroll a baremetal node in ironic, and the node enroll a hypervisor in nova, assume the host of hypervisor is svc1's. 3. create a running baremetal instance using nova compute on the baremetal node; 4. at this time, we can manage this ironic instance with nova, like power on/off; 5. make compute service of the svc1 to down; 6. now, we can't show hypervisor info of this node, and can't power on/off this instance. Expected result === We have 3 compute service, while the svc1 was down, the others should can manage this instance. Actual result === We can't do anything for this instance. Environment === all versions of nova. Other === This is because the IronicDriver flow the libvirt logical, but the ironic compute service only has the management duty, do not need to create instance like libvirt for virtual machine. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1938400/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1923281] [NEW] failed to attach a volume with multiattach to an ironic instance
Public bug reported: Description It's no supported that attaches a volume has `multiattach` set to an ironic instance. It is supported if the volume set `multiattach` to `false`. Additionally,the storage back end for the volume has `multiattach` property. Steps to reproduce: * attach a multiattach volume to an ironic instance. `openstack --os-compute-api-version 2.60 server add volume ` Expected result The volume is attached to the server on /dev/* Actual result Volume has 'multiattach' set, which is not supported for this instance. (HTTP 409) Environment * nova:18.2.4 * cinder: 13.0.8 * ironic: 11.1.3 * storage type: G2 Series block storage of Inspur Inc. * the volume is available state. * the instance is active and power on. * the baremetal volume connector has been created and can be sure is available. More * Logs of nova-compute for ironic: ERROR oslo_messaging.rpc.server [req-17d65e1f-0db3-442c-87eb-57d94ffa6940 421c1e16837b4189b9a6ae04ba4af86b 6e3d2c325bc94a5e8dbb41a7a73ae593 - default default] Exception during message handling: nova.exception.MultiattachNotSupportedByVirtDriver: Volume 3e9b3371-d56d-4b16-a180-00b835993662 has 'multiattach' set, which is not supported for this instance. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1923281 Title: failed to attach a volume with multiattach to an ironic instance Status in OpenStack Compute (nova): New Bug description: Description It's no supported that attaches a volume has `multiattach` set to an ironic instance. It is supported if the volume set `multiattach` to `false`. Additionally,the storage back end for the volume has `multiattach` property. Steps to reproduce: * attach a multiattach volume to an ironic instance. `openstack --os-compute-api-version 2.60 server add volume ` Expected result The volume is attached to the server on /dev/* Actual result Volume has 'multiattach' set, which is not supported for this instance. (HTTP 409) Environment * nova:18.2.4 * cinder: 13.0.8 * ironic: 11.1.3 * storage type: G2 Series block storage of Inspur Inc. * the volume is available state. * the instance is active and power on. * the baremetal volume connector has been created and can be sure is available. More * Logs of nova-compute for ironic: ERROR oslo_messaging.rpc.server [req-17d65e1f-0db3-442c-87eb-57d94ffa6940 421c1e16837b4189b9a6ae04ba4af86b 6e3d2c325bc94a5e8dbb41a7a73ae593 - default default] Exception during message handling: nova.exception.MultiattachNotSupportedByVirtDriver: Volume 3e9b3371-d56d-4b16-a180-00b835993662 has 'multiattach' set, which is not supported for this instance. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1923281/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp