[Yahoo-eng-team] [Bug 1981814] [NEW] swap_volume: Maybe happen IO error or lose user data if the task failed

2022-07-15 Thread Simon Li
Public bug reported:

Description
===
The task of swap_volume is a general and important function for instances and 
in-use volumes. The whole process consists of 3 steps in nova:
* first: connect new volume to libvirt guest(instance is using old volume);
* second: copy or rebase old volume data to new volume(instance is using old 
volume);
* third: update volumes states in cinder and block_device_mapping in nova
  (instance is using new volume);
But the exception handler is too simple: roll-back will be excuted if 
any exception happened in any step and the actual volume used was ingored.
the roll-back operation is to disconnect new volume and delete new attachment.

Clearly, a exception raised in the third step, we can't do roll-back and should
continue to complete the task if the exception is not fatal. otherwise 
Input/Output 
error will happen while user read or write the disk, and user data maybe lose 
if 
the data write to new volume but was roll-back.


Steps to reproduce
==
1. create an instance and attach a available volume to it:
  $ openstack server create my-vm --flavor m1.medium --image  
--network 
  $ openstack volume create my-vol --type  --size 100
  $ openstack server add volume my-vm my-vol
2. enter my-vm, make file system and mount /dev/vdc, then read-write the 
/dev/vdc
  $ mkfs.ext4 /dev/vdc
  $ mount /dev/vdc /mnt
  $ touch /mnt/test
  $ fio -rw=randrw -ioengine=libaio -bs=4K -size=20G -filename=/mnt/test ...
3. retype the volume:
  $ openstack volume set my-vol --type  --retype-policy on-demand
4. Some accidents cause nova disconnect old volume failed in third step after 
the 
  second step is finished successfully, and the task finally failed.
5. fio can't read or write file /mnt/test.

Expected result
===
After exception happened in step 4, the disk should normally read and write.

Actual result
=
Just as step 5, user can't read and write disk.

Environment
===
1. nova version: 22.0.1

2. hypervisor: Libvirt+Qemu

2. Storage: ceph, FC-San, LVM

3. network: Neutron + ovs

Logs & Configs
==

** Affects: nova
 Importance: Undecided
 Status: Confirmed

** Changed in: nova
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1981814

Title:
  swap_volume: Maybe happen IO error or lose user data if the task
  failed

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  Description
  ===
  The task of swap_volume is a general and important function for instances and 
  in-use volumes. The whole process consists of 3 steps in nova:
  * first: connect new volume to libvirt guest(instance is using old volume);
  * second: copy or rebase old volume data to new volume(instance is using old 
volume);
  * third: update volumes states in cinder and block_device_mapping in nova
(instance is using new volume);
  But the exception handler is too simple: roll-back will be excuted if 
  any exception happened in any step and the actual volume used was ingored.
  the roll-back operation is to disconnect new volume and delete new attachment.

  Clearly, a exception raised in the third step, we can't do roll-back and 
should
  continue to complete the task if the exception is not fatal. otherwise 
Input/Output 
  error will happen while user read or write the disk, and user data maybe lose 
if 
  the data write to new volume but was roll-back.

  
  Steps to reproduce
  ==
  1. create an instance and attach a available volume to it:
$ openstack server create my-vm --flavor m1.medium --image  
--network 
$ openstack volume create my-vol --type  --size 100
$ openstack server add volume my-vm my-vol
  2. enter my-vm, make file system and mount /dev/vdc, then read-write the 
/dev/vdc
$ mkfs.ext4 /dev/vdc
$ mount /dev/vdc /mnt
$ touch /mnt/test
$ fio -rw=randrw -ioengine=libaio -bs=4K -size=20G -filename=/mnt/test ...
  3. retype the volume:
$ openstack volume set my-vol --type  --retype-policy on-demand
  4. Some accidents cause nova disconnect old volume failed in third step after 
the 
second step is finished successfully, and the task finally failed.
  5. fio can't read or write file /mnt/test.

  Expected result
  ===
  After exception happened in step 4, the disk should normally read and write.

  Actual result
  =
  Just as step 5, user can't read and write disk.

  Environment
  ===
  1. nova version: 22.0.1

  2. hypervisor: Libvirt+Qemu

  2. Storage: ceph, FC-San, LVM

  3. network: Neutron + ovs

  Logs & Configs
  ==

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1981814/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : 

[Yahoo-eng-team] [Bug 1949051] [NEW] nova compute service running IronicDriver maybe leak memory

2021-10-28 Thread Simon Li
Public bug reported:

Description
===
We run nova-compute service with IronicDriver in k8s cluster as statefulSet 
pod, with 1GiB memory limit and only this service in POD. 
There are about 40 nodes in our test environment. Most of them have instances 
and in active provision state. 
Some nodes fail to connect to the IPMI. As a result, they cannot obtain the 
power status.
In about 12 hours, the memory limit is exceeded and the POD is restarted.

Steps to reproduce
==
Nothing need to do.
Note the flowing:
1. The more nodes there are, the faster the memory grows and the shorter the 
time limit is exceeded.
2. Even with only one node, the memory limit will be exceeded, but for a long 
time.
3. In our environment, the frequency of memory growth is around 10min, so we 
suspect that is caused by periodic task, maybe `_sync_power_states` task.
4. I am not sure whether the IPMI connection has any impact.


Expected result
===
Memory of the pod should be in a stable state when we are not performing 
operations on nodes/instances.

Actual result
=
Memory keeps increasing until the limit is exceeded and the POD is restarted.

Environment
===
openstack version
   - nova: 22.0.1
   - ironic: 16-0-1

Logs & Configs
==

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1949051

Title:
  nova compute service running IronicDriver  maybe leak memory

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  We run nova-compute service with IronicDriver in k8s cluster as statefulSet 
pod, with 1GiB memory limit and only this service in POD. 
  There are about 40 nodes in our test environment. Most of them have instances 
and in active provision state. 
  Some nodes fail to connect to the IPMI. As a result, they cannot obtain the 
power status.
  In about 12 hours, the memory limit is exceeded and the POD is restarted.

  Steps to reproduce
  ==
  Nothing need to do.
  Note the flowing:
  1. The more nodes there are, the faster the memory grows and the shorter the 
time limit is exceeded.
  2. Even with only one node, the memory limit will be exceeded, but for a long 
time.
  3. In our environment, the frequency of memory growth is around 10min, so we 
suspect that is caused by periodic task, maybe `_sync_power_states` task.
  4. I am not sure whether the IPMI connection has any impact.

  
  Expected result
  ===
  Memory of the pod should be in a stable state when we are not performing 
operations on nodes/instances.

  Actual result
  =
  Memory keeps increasing until the limit is exceeded and the POD is restarted.

  Environment
  ===
  openstack version
 - nova: 22.0.1
 - ironic: 16-0-1

  Logs & Configs
  ==

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1949051/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1938400] [NEW] compute service for ironic instance is not high availability

2021-07-29 Thread Simon Li
Public bug reported:

Description
===
The ironic instance will not be managed by nova if the service of itself is 
down,even the environment has multiple compute host and has up status service.

Steps to reproduce
===
1. deploy multiple compute service with IronicDriver, call them svc1,svc2,svc3;
2. enroll a baremetal node in ironic, and the node enroll a hypervisor in nova, 
assume the host of hypervisor is svc1's.
3. create a running baremetal instance using nova compute on the baremetal node;
4. at this time, we can manage this ironic instance with nova, like power 
on/off;
5. make compute service of the svc1 to down;
6. now, we can't show hypervisor info of this node, and can't power on/off this 
instance.

Expected result
===
We have 3 compute service, while the svc1 was down, the others should can 
manage this instance.

Actual result
===
We can't do anything for this instance.

Environment
===
all versions of nova.

Other
===
This is because the IronicDriver flow the libvirt logical, but the ironic 
compute service only has the management duty, do not need to create instance 
like libvirt for virtual machine.

** Affects: nova
 Importance: Undecided
 Status: New

** Description changed:

- # Description
+ Description
+ ===
  The ironic instance will not be managed by nova if the service of itself is 
down,even the environment has multiple compute host and has up status service.
  
- # Steps to reproduce
+ Steps to reproduce
+ ===
  1. deploy multiple compute service with IronicDriver, call them 
svc1,svc2,svc3;
  2. enroll a baremetal node in ironic, and the node enroll a hypervisor in 
nova, assume the host of hypervisor is svc1's.
  3. create a running baremetal instance using nova compute on the baremetal 
node;
  4. at this time, we can manage this ironic instance with nova, like power 
on/off;
  5. make compute service of the svc1 to down;
  6. now, we can't show hypervisor info of this node, and can't power on/off 
this instance.
  
- # Expected result
+ Expected result
+ ===
  We have 3 compute service, while the svc1 was down, the others should can 
manage this instance.
  
- # Actual result
+ Actual result
+ ===
  We can't do anything for this instance.
  
- # Environment
+ Environment
+ ===
  all versions of nova.
  
- # Other
+ Other
+ ===
  This is because the IronicDriver flow the libvirt logical, but the ironic 
compute service only has the management duty, do not need to create instance 
like libvirt for virtual machine.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1938400

Title:
  compute service for ironic instance is not high availability

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  The ironic instance will not be managed by nova if the service of itself is 
down,even the environment has multiple compute host and has up status service.

  Steps to reproduce
  ===
  1. deploy multiple compute service with IronicDriver, call them 
svc1,svc2,svc3;
  2. enroll a baremetal node in ironic, and the node enroll a hypervisor in 
nova, assume the host of hypervisor is svc1's.
  3. create a running baremetal instance using nova compute on the baremetal 
node;
  4. at this time, we can manage this ironic instance with nova, like power 
on/off;
  5. make compute service of the svc1 to down;
  6. now, we can't show hypervisor info of this node, and can't power on/off 
this instance.

  Expected result
  ===
  We have 3 compute service, while the svc1 was down, the others should can 
manage this instance.

  Actual result
  ===
  We can't do anything for this instance.

  Environment
  ===
  all versions of nova.

  Other
  ===
  This is because the IronicDriver flow the libvirt logical, but the ironic 
compute service only has the management duty, do not need to create instance 
like libvirt for virtual machine.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1938400/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1923281] [NEW] failed to attach a volume with multiattach to an ironic instance

2021-04-09 Thread Simon Li
Public bug reported:

Description
It's no supported that attaches a volume has `multiattach` set to an ironic 
instance. 
It is supported if the volume set `multiattach` to `false`.
Additionally,the storage back end for the volume has `multiattach` property.

Steps to reproduce:
* attach a multiattach volume to an ironic instance.
`openstack --os-compute-api-version 2.60 server add volume  
`

Expected result
The volume is attached to the server on /dev/*

Actual result
Volume  has 'multiattach' set, which is not supported for this instance. 
(HTTP 409) 

Environment
* nova:18.2.4
* cinder: 13.0.8
* ironic: 11.1.3
* storage type: G2 Series block storage of Inspur Inc.
* the volume is available state.
* the instance is active and power on.
* the baremetal volume connector has been created and can be sure is available.

More
* Logs of nova-compute for ironic:
ERROR oslo_messaging.rpc.server [req-17d65e1f-0db3-442c-87eb-57d94ffa6940 
421c1e16837b4189b9a6ae04ba4af86b 6e3d2c325bc94a5e8dbb41a7a73ae593 - default 
default] Exception during message handling: 
nova.exception.MultiattachNotSupportedByVirtDriver: Volume 
3e9b3371-d56d-4b16-a180-00b835993662 has 'multiattach' set, which is not 
supported for this instance.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1923281

Title:
  failed to attach a volume with multiattach to an ironic instance

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  It's no supported that attaches a volume has `multiattach` set to an ironic 
instance. 
  It is supported if the volume set `multiattach` to `false`.
  Additionally,the storage back end for the volume has `multiattach` property.

  Steps to reproduce:
  * attach a multiattach volume to an ironic instance.
  `openstack --os-compute-api-version 2.60 server add volume  
`

  Expected result
  The volume is attached to the server on /dev/*

  Actual result
  Volume  has 'multiattach' set, which is not supported for this 
instance. (HTTP 409) 

  Environment
  * nova:18.2.4
  * cinder: 13.0.8
  * ironic: 11.1.3
  * storage type: G2 Series block storage of Inspur Inc.
  * the volume is available state.
  * the instance is active and power on.
  * the baremetal volume connector has been created and can be sure is 
available.

  More
  * Logs of nova-compute for ironic:
  ERROR oslo_messaging.rpc.server [req-17d65e1f-0db3-442c-87eb-57d94ffa6940 
421c1e16837b4189b9a6ae04ba4af86b 6e3d2c325bc94a5e8dbb41a7a73ae593 - default 
default] Exception during message handling: 
nova.exception.MultiattachNotSupportedByVirtDriver: Volume 
3e9b3371-d56d-4b16-a180-00b835993662 has 'multiattach' set, which is not 
supported for this instance.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1923281/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp