[Yahoo-eng-team] [Bug 1813223] [NEW] "Requested operation is not valid: domain is not running" (REBUILD)

2019-01-24 Thread Wallace Cardoso
Public bug reported:

I'm not sure if I can do these steps in practice, but I could perform
the following steps in the system, and I got the an error.

DevStack branch=stable/rocky.

1.  ./unstack.sh && ./clean.sh && ./stack.sh
2.  source openrc admin admin
3.  openstack flavor create --ram 21 --disk 0 --vcpus 1 custom
4.  openstack server create --flavor custom --image cirros-0.3.5-x85_64-disk 
--flavor custom test
5.  openstack server show test

+-+-+
| Field   | Value   
|
+-+-+
| OS-DCF:diskConfig   | MANUAL  
|
| OS-EXT-AZ:availability_zone | nova
|
| OS-EXT-SRV-ATTR:host| wallacec-ubuntu 
|
| OS-EXT-SRV-ATTR:hypervisor_hostname | wallacec-ubuntu 
|
| OS-EXT-SRV-ATTR:instance_name   | instance-0004   
|
| OS-EXT-STS:power_state  | Paused  
|
| OS-EXT-STS:task_state   | None
|
| OS-EXT-STS:vm_state | active  
|
| OS-SRV-USG:launched_at  | 2019-01-24T23:13:07.00  
|
| OS-SRV-USG:terminated_at| None
|
| accessIPv4  | 
|
| accessIPv6  | 
|
| addresses   | public=2001:db8::d, 192.168.1.228   
|
| config_drive| 
|
| created | 2019-01-24T23:13:01Z
|
| flavor  | custom 
(ac9f385c-efaa-4b93-acec-8184beb53ca3)   |
| hostId  | 
d99cb6d42c024008ba7f954f95a59d73313aebf95098e30ccb7f10f0|
| id  | e7825018-5fd7-4377-a6c1-cf36c269d849
|
| image   | cirros-0.3.5-x86_64-disk 
(3739ba2a-34ab-4bcd-8fd3-70a186131e54) |
| key_name| None
|
| name| test
|
| progress| 0   
|
| project_id  | 6a0880f1c0b946acb71d61af9a92900b
|
| properties  | 
|
| security_groups | name='default'  
|
| status  | ACTIVE  
|
| updated | 2019-01-24T23:13:08Z
|
| user_id | 7c9be80e945f4333ad34d11f64643f51
|
| volumes_attached| 
|
+-+-+


6.  openstack server rebuild test
7.  openstack server list

+--+--++---+--++
| ID   | Name | Status | Networks   
   | Image| Flavor |
+--+--++---+--++
| e7825018-5fd7-4377-a6c1-cf36c269d849 | test | ERROR  | public=2001:db8::d, 
192.168.1.228 | cirros-0.3.5-x86_64-disk | custom |
+--+--++---+--++

Logs:

Jan 24 21:15:03 wallacec-ubuntu nova-compute[24652]: ERROR 
oslo_messaging.rpc.server #033[01;35m#033[00m  File 
"/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
Jan 24 21:15:03 wallacec-ubuntu nova-compute[24652]: ERROR 
oslo_messaging.rpc.server #033[01;35m#033[00m

[Yahoo-eng-team] [Bug 1804205] [NEW] libvirtError: operation failed: domain 'instance-00000003' already exists with uuid e017717e-9647-4740-8efa-4ec2aa25f35c

2018-11-20 Thread Wallace Cardoso
Public bug reported:

Description
==
Testing the Nova component, I figured out that many times the instances 
(servers) were not created or instantiated because the following error:

libvirtError: operation failed: domain 'instance-0003' already
exists with uuid e017717e-9647-4740-8efa-4ec2aa25f35c

I don't understand how that is possible. There exist a way of
circumventing that? Can OpenStack solve it before creating an instance?

Worth noticing that I ran the unstack script before starting creating
new instances, so it is a residual instance from another time I ran
stack script.

Steps to Reproduce
=
I don't know.

Expected Result
==
Not sure, but I think that the system needs to be prepared against this 
situation. The worst case is the scenario where it is not possible to create 
instances because a range of names is being used.

Actual Result

Many instances been set to error state because a name of an existent instance 
already exists.

Environment
==
Devstack/Queens/Stable.
Ubuntu 16.04.

Logs & Configs

Attached.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1804205

Title:
  libvirtError: operation failed: domain 'instance-0003' already
  exists with uuid e017717e-9647-4740-8efa-4ec2aa25f35c

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ==
  Testing the Nova component, I figured out that many times the instances 
(servers) were not created or instantiated because the following error:

  libvirtError: operation failed: domain 'instance-0003' already
  exists with uuid e017717e-9647-4740-8efa-4ec2aa25f35c

  I don't understand how that is possible. There exist a way of
  circumventing that? Can OpenStack solve it before creating an
  instance?

  Worth noticing that I ran the unstack script before starting creating
  new instances, so it is a residual instance from another time I ran
  stack script.

  Steps to Reproduce
  =
  I don't know.

  Expected Result
  ==
  Not sure, but I think that the system needs to be prepared against this 
situation. The worst case is the scenario where it is not possible to create 
instances because a range of names is being used.

  Actual Result
  
  Many instances been set to error state because a name of an existent instance 
already exists.

  Environment
  ==
  Devstack/Queens/Stable.
  Ubuntu 16.04.

  Logs & Configs
  
  Attached.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1804205/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1801733] [NEW] nova-compute consuming 100% of cpu after rebuilding with invalid data parameters

2018-11-05 Thread Wallace Cardoso
Public bug reported:

Description
==
The 'conductor-api' for 'rebuild_instance' has a vulnerability point for the 
parameter 
'rebuild_instance/args/instance/nova_object.data/flavor/nova_object.data/vcpus'.
 When set to an invalid number of vcpus in the flavor of the instance, the 
compute component takes 100% of cpu consuming forever without changing the 
state from rebuild to active (or error). In addition, new requests to compute 
component are not computed, that is, the node gets out-of-service until its 
restart. Maybe, this bug can be a way of using a denial-of-service attack.

Steps to reproduce
=
1) create an instance with the flavor (VCPUS: 1, MEM: 64MB, STORAGE: 0GB) and 
the cirros image 0.3.4;
2) rebuild the instance with an alternative cirros image 0.4.0;
2.1) intercept the message to 'conductor' api (ComputeTaskAPI) for the method 
'rebuild_instance', and change the parameter 
'rebuild_instance/args/instance/nova_object.data/flavor/nova_object.data/vcpus' 
to 101;
3) rebuild again the instance with the original image of the instance (cirros 
0.3.4);
4) shelve the instance;
5) delete the instance;

Expected result

Even that rebuild is not an action that takes the flavor into account, should 
exist something for ensuring correctness of other parameters. The compute node 
does not stop working because of an invalid parameter.

Actual result

The instance does not change from rebuild to active, remaining rebuilding 
forever, and the compute node gets innoperating until the services be 
restarted. 'nova-compute' consuming 100% of cpu.

Environment
==
I used devstack/stable/queens, a fresh Ubuntu environment.

Logs & Configs
=
Logs attached.
The fault is injected after 11:24:16.
If you search for '101', you will see the line below:
Nov  5 11:24:21 localhost nova-compute[14517]: #033[00;32mDEBUG 
nova.virt.hardware [#033[01;36mNone req-f97def42-9630-4165-81e5-abc0cab5c02f 
#033[00;36madmin admin#033[00;32m] #033[01;35m#033[00;32mBuild topologies for 
101 vcpu(s) 65536:65536:65536#033[00m 
#033[00;33m{{(pid=14517) _get_possible_cpu_topologies 
/opt/stack/queens/dest/nova/nova/virt/hardware.py:418}}#033[00m

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: fault-injection

** Attachment added: "Syslog"
   
https://bugs.launchpad.net/bugs/1801733/+attachment/5209338/+files/newbug.sys.logs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1801733

Title:
  nova-compute consuming 100% of cpu after rebuilding with invalid data
  parameters

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ==
  The 'conductor-api' for 'rebuild_instance' has a vulnerability point for the 
parameter 
'rebuild_instance/args/instance/nova_object.data/flavor/nova_object.data/vcpus'.
 When set to an invalid number of vcpus in the flavor of the instance, the 
compute component takes 100% of cpu consuming forever without changing the 
state from rebuild to active (or error). In addition, new requests to compute 
component are not computed, that is, the node gets out-of-service until its 
restart. Maybe, this bug can be a way of using a denial-of-service attack.

  Steps to reproduce
  =
  1) create an instance with the flavor (VCPUS: 1, MEM: 64MB, STORAGE: 0GB) and 
the cirros image 0.3.4;
  2) rebuild the instance with an alternative cirros image 0.4.0;
  2.1) intercept the message to 'conductor' api (ComputeTaskAPI) for the method 
'rebuild_instance', and change the parameter 
'rebuild_instance/args/instance/nova_object.data/flavor/nova_object.data/vcpus' 
to 101;
  3) rebuild again the instance with the original image of the instance (cirros 
0.3.4);
  4) shelve the instance;
  5) delete the instance;

  Expected result
  
  Even that rebuild is not an action that takes the flavor into account, should 
exist something for ensuring correctness of other parameters. The compute node 
does not stop working because of an invalid parameter.

  Actual result
  
  The instance does not change from rebuild to active, remaining rebuilding 
forever, and the compute node gets innoperating until the services be 
restarted. 'nova-compute' consuming 100% of cpu.

  Environment
  ==
  I used devstack/stable/queens, a fresh Ubuntu environment.

  Logs & Configs
  =
  Logs attached.
  The fault is injected after 11:24:16.
  If you search for '101', you will see the line below:
  Nov  5 11:24:21 localhost nova-compute[14517]: #033[00;32mDEBUG 
nova.virt.hardware [#033[01;36mNone req-f97def42-9630-4165-81e5-abc0cab5c02f 
#033[00;36madmin admin#033[00;32m] #033[01;35m#033[00;32mBuild 

[Yahoo-eng-team] [Bug 1800508] [NEW] Missing exception handling mechanism in 'schedule_and_build_instances' for DBError at line 1180

2018-10-29 Thread Wallace Cardoso
Public bug reported:

Description
==
If an error occurs during instance creation, the user won't be able to know 
what exactly happened with the VM that remains always building. As usual, the 
workflow of creating a VM was interrupted by an exception in the method 
schedule_and_build_instances, so the result would be the VM is in 'error' state.

Steps to reproduce
=
1) Create a VM;
2) Inject an out-of-range value in 
"schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.instance_type_id",
 this will be enough to cause a DBError. For instance, it can be used the 1E+22 
value.
3) An exception will be thrown, but seems there no exist an appropriate action 
when this DBError happens.

Expected result
==
The VM is put in 'error' state

Actual result

The VM is in 'build' state indeterminately, and the user never will know 
(without searching in the logs) what happened with the VM.

Environment
==
Devstack/Stable/Queens.

Logs & Configs
=
Logs attached.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1800508

Title:
  Missing exception handling mechanism in 'schedule_and_build_instances'
  for DBError at line 1180

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ==
  If an error occurs during instance creation, the user won't be able to know 
what exactly happened with the VM that remains always building. As usual, the 
workflow of creating a VM was interrupted by an exception in the method 
schedule_and_build_instances, so the result would be the VM is in 'error' state.

  Steps to reproduce
  =
  1) Create a VM;
  2) Inject an out-of-range value in 
"schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.instance_type_id",
 this will be enough to cause a DBError. For instance, it can be used the 1E+22 
value.
  3) An exception will be thrown, but seems there no exist an appropriate 
action when this DBError happens.

  Expected result
  ==
  The VM is put in 'error' state

  Actual result
  
  The VM is in 'build' state indeterminately, and the user never will know 
(without searching in the logs) what happened with the VM.

  Environment
  ==
  Devstack/Stable/Queens.

  Logs & Configs
  =
  Logs attached.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1800508/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1800204] [NEW] n-cpu.service consuming 100% of CPU indeterminately

2018-10-26 Thread Wallace Cardoso
Public bug reported:

Description
==
I used fault injection to assess the robustness of the nova-conductor, and by 
injecting a specific sequence of failures I saw a failure that can threaten the 
robustness of the system. The resulting of applying these faults in the 
interface of nova-conductor prevent the nova-compute provisioning new instances.

Steps to reproduce
=
I reproduced this bug 100% from 10 attempts. I used devstack/queens.

The workload I used is of the following steps:
1) First, create a VM with the following flavor: 64MB RAM, 1 VCPU, 0 DISK; and 
the reference image 'cirros.0.3.4' for instance; all other settings can be the 
defaults of admin account;
2) Rebuild with an alternative image: for instance, 'cirros 0.4.0';
3) Rebuild with the reference image again;
4) Shelve the instance;
5) Delete the instance;

Below, I describe the faultload. For each time a fault is injected, the 
workload is executed from its begin. The steps are:
1) Intercept the first RPC message (i.e. AMQP) that calls for 
'schedule_and_build_instances';
2) Inject the 'fault' in 
'schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.flavor.'nova_object.data'.vcpus'

The pseudo-algorithm:
1. execute workload
2. for each fault in ['2', '-101', 
'100']
2.1.   execute workload in parallel with faultload(fault)
3. see the CPU activity for the process n-cpu.service of devstack

Expected result
==
nova-compute handles the faults not impacting in future requests.

Actual result

nova-compute consumes 100% of CPU and new instances is set to 'error' state 
without any clue about the issue, so it is not possible to create new instances 
without restarting n-cpu.service

Environment
==
Devstack/Queens in Single Machine with defaults.

Logs & Configs
=
Logs attached.

** Affects: nova
 Importance: Undecided
 Status: New

** Attachment added: "Logs from before to after applying the tests"
   
https://bugs.launchpad.net/bugs/1800204/+attachment/5205956/+files/sys-100p-now.logs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1800204

Title:
  n-cpu.service consuming 100% of CPU indeterminately

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ==
  I used fault injection to assess the robustness of the nova-conductor, and by 
injecting a specific sequence of failures I saw a failure that can threaten the 
robustness of the system. The resulting of applying these faults in the 
interface of nova-conductor prevent the nova-compute provisioning new instances.

  Steps to reproduce
  =
  I reproduced this bug 100% from 10 attempts. I used devstack/queens.

  The workload I used is of the following steps:
  1) First, create a VM with the following flavor: 64MB RAM, 1 VCPU, 0 DISK; 
and the reference image 'cirros.0.3.4' for instance; all other settings can be 
the defaults of admin account;
  2) Rebuild with an alternative image: for instance, 'cirros 0.4.0';
  3) Rebuild with the reference image again;
  4) Shelve the instance;
  5) Delete the instance;

  Below, I describe the faultload. For each time a fault is injected, the 
workload is executed from its begin. The steps are:
  1) Intercept the first RPC message (i.e. AMQP) that calls for 
'schedule_and_build_instances';
  2) Inject the 'fault' in 
'schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.flavor.'nova_object.data'.vcpus'

  The pseudo-algorithm:
  1. execute workload
  2. for each fault in ['2', '-101', 
'100']
  2.1.   execute workload in parallel with faultload(fault)
  3. see the CPU activity for the process n-cpu.service of devstack

  Expected result
  ==
  nova-compute handles the faults not impacting in future requests.

  Actual result
  
  nova-compute consumes 100% of CPU and new instances is set to 'error' state 
without any clue about the issue, so it is not possible to create new instances 
without restarting n-cpu.service

  Environment
  ==
  Devstack/Queens in Single Machine with defaults.

  Logs & Configs
  =
  Logs attached.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1800204/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp