[Yahoo-eng-team] [Bug 1813223] [NEW] "Requested operation is not valid: domain is not running" (REBUILD)
Public bug reported: I'm not sure if I can do these steps in practice, but I could perform the following steps in the system, and I got the an error. DevStack branch=stable/rocky. 1. ./unstack.sh && ./clean.sh && ./stack.sh 2. source openrc admin admin 3. openstack flavor create --ram 21 --disk 0 --vcpus 1 custom 4. openstack server create --flavor custom --image cirros-0.3.5-x85_64-disk --flavor custom test 5. openstack server show test +-+-+ | Field | Value | +-+-+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host| wallacec-ubuntu | | OS-EXT-SRV-ATTR:hypervisor_hostname | wallacec-ubuntu | | OS-EXT-SRV-ATTR:instance_name | instance-0004 | | OS-EXT-STS:power_state | Paused | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-01-24T23:13:07.00 | | OS-SRV-USG:terminated_at| None | | accessIPv4 | | | accessIPv6 | | | addresses | public=2001:db8::d, 192.168.1.228 | | config_drive| | | created | 2019-01-24T23:13:01Z | | flavor | custom (ac9f385c-efaa-4b93-acec-8184beb53ca3) | | hostId | d99cb6d42c024008ba7f954f95a59d73313aebf95098e30ccb7f10f0| | id | e7825018-5fd7-4377-a6c1-cf36c269d849 | | image | cirros-0.3.5-x86_64-disk (3739ba2a-34ab-4bcd-8fd3-70a186131e54) | | key_name| None | | name| test | | progress| 0 | | project_id | 6a0880f1c0b946acb71d61af9a92900b | | properties | | | security_groups | name='default' | | status | ACTIVE | | updated | 2019-01-24T23:13:08Z | | user_id | 7c9be80e945f4333ad34d11f64643f51 | | volumes_attached| | +-+-+ 6. openstack server rebuild test 7. openstack server list +--+--++---+--++ | ID | Name | Status | Networks | Image| Flavor | +--+--++---+--++ | e7825018-5fd7-4377-a6c1-cf36c269d849 | test | ERROR | public=2001:db8::d, 192.168.1.228 | cirros-0.3.5-x86_64-disk | custom | +--+--++---+--++ Logs: Jan 24 21:15:03 wallacec-ubuntu nova-compute[24652]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit Jan 24 21:15:03 wallacec-ubuntu nova-compute[24652]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m
[Yahoo-eng-team] [Bug 1804205] [NEW] libvirtError: operation failed: domain 'instance-00000003' already exists with uuid e017717e-9647-4740-8efa-4ec2aa25f35c
Public bug reported: Description == Testing the Nova component, I figured out that many times the instances (servers) were not created or instantiated because the following error: libvirtError: operation failed: domain 'instance-0003' already exists with uuid e017717e-9647-4740-8efa-4ec2aa25f35c I don't understand how that is possible. There exist a way of circumventing that? Can OpenStack solve it before creating an instance? Worth noticing that I ran the unstack script before starting creating new instances, so it is a residual instance from another time I ran stack script. Steps to Reproduce = I don't know. Expected Result == Not sure, but I think that the system needs to be prepared against this situation. The worst case is the scenario where it is not possible to create instances because a range of names is being used. Actual Result Many instances been set to error state because a name of an existent instance already exists. Environment == Devstack/Queens/Stable. Ubuntu 16.04. Logs & Configs Attached. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1804205 Title: libvirtError: operation failed: domain 'instance-0003' already exists with uuid e017717e-9647-4740-8efa-4ec2aa25f35c Status in OpenStack Compute (nova): New Bug description: Description == Testing the Nova component, I figured out that many times the instances (servers) were not created or instantiated because the following error: libvirtError: operation failed: domain 'instance-0003' already exists with uuid e017717e-9647-4740-8efa-4ec2aa25f35c I don't understand how that is possible. There exist a way of circumventing that? Can OpenStack solve it before creating an instance? Worth noticing that I ran the unstack script before starting creating new instances, so it is a residual instance from another time I ran stack script. Steps to Reproduce = I don't know. Expected Result == Not sure, but I think that the system needs to be prepared against this situation. The worst case is the scenario where it is not possible to create instances because a range of names is being used. Actual Result Many instances been set to error state because a name of an existent instance already exists. Environment == Devstack/Queens/Stable. Ubuntu 16.04. Logs & Configs Attached. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1804205/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1801733] [NEW] nova-compute consuming 100% of cpu after rebuilding with invalid data parameters
Public bug reported: Description == The 'conductor-api' for 'rebuild_instance' has a vulnerability point for the parameter 'rebuild_instance/args/instance/nova_object.data/flavor/nova_object.data/vcpus'. When set to an invalid number of vcpus in the flavor of the instance, the compute component takes 100% of cpu consuming forever without changing the state from rebuild to active (or error). In addition, new requests to compute component are not computed, that is, the node gets out-of-service until its restart. Maybe, this bug can be a way of using a denial-of-service attack. Steps to reproduce = 1) create an instance with the flavor (VCPUS: 1, MEM: 64MB, STORAGE: 0GB) and the cirros image 0.3.4; 2) rebuild the instance with an alternative cirros image 0.4.0; 2.1) intercept the message to 'conductor' api (ComputeTaskAPI) for the method 'rebuild_instance', and change the parameter 'rebuild_instance/args/instance/nova_object.data/flavor/nova_object.data/vcpus' to 101; 3) rebuild again the instance with the original image of the instance (cirros 0.3.4); 4) shelve the instance; 5) delete the instance; Expected result Even that rebuild is not an action that takes the flavor into account, should exist something for ensuring correctness of other parameters. The compute node does not stop working because of an invalid parameter. Actual result The instance does not change from rebuild to active, remaining rebuilding forever, and the compute node gets innoperating until the services be restarted. 'nova-compute' consuming 100% of cpu. Environment == I used devstack/stable/queens, a fresh Ubuntu environment. Logs & Configs = Logs attached. The fault is injected after 11:24:16. If you search for '101', you will see the line below: Nov 5 11:24:21 localhost nova-compute[14517]: #033[00;32mDEBUG nova.virt.hardware [#033[01;36mNone req-f97def42-9630-4165-81e5-abc0cab5c02f #033[00;36madmin admin#033[00;32m] #033[01;35m#033[00;32mBuild topologies for 101 vcpu(s) 65536:65536:65536#033[00m #033[00;33m{{(pid=14517) _get_possible_cpu_topologies /opt/stack/queens/dest/nova/nova/virt/hardware.py:418}}#033[00m ** Affects: nova Importance: Undecided Status: New ** Tags: fault-injection ** Attachment added: "Syslog" https://bugs.launchpad.net/bugs/1801733/+attachment/5209338/+files/newbug.sys.logs -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1801733 Title: nova-compute consuming 100% of cpu after rebuilding with invalid data parameters Status in OpenStack Compute (nova): New Bug description: Description == The 'conductor-api' for 'rebuild_instance' has a vulnerability point for the parameter 'rebuild_instance/args/instance/nova_object.data/flavor/nova_object.data/vcpus'. When set to an invalid number of vcpus in the flavor of the instance, the compute component takes 100% of cpu consuming forever without changing the state from rebuild to active (or error). In addition, new requests to compute component are not computed, that is, the node gets out-of-service until its restart. Maybe, this bug can be a way of using a denial-of-service attack. Steps to reproduce = 1) create an instance with the flavor (VCPUS: 1, MEM: 64MB, STORAGE: 0GB) and the cirros image 0.3.4; 2) rebuild the instance with an alternative cirros image 0.4.0; 2.1) intercept the message to 'conductor' api (ComputeTaskAPI) for the method 'rebuild_instance', and change the parameter 'rebuild_instance/args/instance/nova_object.data/flavor/nova_object.data/vcpus' to 101; 3) rebuild again the instance with the original image of the instance (cirros 0.3.4); 4) shelve the instance; 5) delete the instance; Expected result Even that rebuild is not an action that takes the flavor into account, should exist something for ensuring correctness of other parameters. The compute node does not stop working because of an invalid parameter. Actual result The instance does not change from rebuild to active, remaining rebuilding forever, and the compute node gets innoperating until the services be restarted. 'nova-compute' consuming 100% of cpu. Environment == I used devstack/stable/queens, a fresh Ubuntu environment. Logs & Configs = Logs attached. The fault is injected after 11:24:16. If you search for '101', you will see the line below: Nov 5 11:24:21 localhost nova-compute[14517]: #033[00;32mDEBUG nova.virt.hardware [#033[01;36mNone req-f97def42-9630-4165-81e5-abc0cab5c02f #033[00;36madmin admin#033[00;32m] #033[01;35m#033[00;32mBuild
[Yahoo-eng-team] [Bug 1800508] [NEW] Missing exception handling mechanism in 'schedule_and_build_instances' for DBError at line 1180
Public bug reported: Description == If an error occurs during instance creation, the user won't be able to know what exactly happened with the VM that remains always building. As usual, the workflow of creating a VM was interrupted by an exception in the method schedule_and_build_instances, so the result would be the VM is in 'error' state. Steps to reproduce = 1) Create a VM; 2) Inject an out-of-range value in "schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.instance_type_id", this will be enough to cause a DBError. For instance, it can be used the 1E+22 value. 3) An exception will be thrown, but seems there no exist an appropriate action when this DBError happens. Expected result == The VM is put in 'error' state Actual result The VM is in 'build' state indeterminately, and the user never will know (without searching in the logs) what happened with the VM. Environment == Devstack/Stable/Queens. Logs & Configs = Logs attached. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1800508 Title: Missing exception handling mechanism in 'schedule_and_build_instances' for DBError at line 1180 Status in OpenStack Compute (nova): New Bug description: Description == If an error occurs during instance creation, the user won't be able to know what exactly happened with the VM that remains always building. As usual, the workflow of creating a VM was interrupted by an exception in the method schedule_and_build_instances, so the result would be the VM is in 'error' state. Steps to reproduce = 1) Create a VM; 2) Inject an out-of-range value in "schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.instance_type_id", this will be enough to cause a DBError. For instance, it can be used the 1E+22 value. 3) An exception will be thrown, but seems there no exist an appropriate action when this DBError happens. Expected result == The VM is put in 'error' state Actual result The VM is in 'build' state indeterminately, and the user never will know (without searching in the logs) what happened with the VM. Environment == Devstack/Stable/Queens. Logs & Configs = Logs attached. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1800508/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1800204] [NEW] n-cpu.service consuming 100% of CPU indeterminately
Public bug reported: Description == I used fault injection to assess the robustness of the nova-conductor, and by injecting a specific sequence of failures I saw a failure that can threaten the robustness of the system. The resulting of applying these faults in the interface of nova-conductor prevent the nova-compute provisioning new instances. Steps to reproduce = I reproduced this bug 100% from 10 attempts. I used devstack/queens. The workload I used is of the following steps: 1) First, create a VM with the following flavor: 64MB RAM, 1 VCPU, 0 DISK; and the reference image 'cirros.0.3.4' for instance; all other settings can be the defaults of admin account; 2) Rebuild with an alternative image: for instance, 'cirros 0.4.0'; 3) Rebuild with the reference image again; 4) Shelve the instance; 5) Delete the instance; Below, I describe the faultload. For each time a fault is injected, the workload is executed from its begin. The steps are: 1) Intercept the first RPC message (i.e. AMQP) that calls for 'schedule_and_build_instances'; 2) Inject the 'fault' in 'schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.flavor.'nova_object.data'.vcpus' The pseudo-algorithm: 1. execute workload 2. for each fault in ['2', '-101', '100'] 2.1. execute workload in parallel with faultload(fault) 3. see the CPU activity for the process n-cpu.service of devstack Expected result == nova-compute handles the faults not impacting in future requests. Actual result nova-compute consumes 100% of CPU and new instances is set to 'error' state without any clue about the issue, so it is not possible to create new instances without restarting n-cpu.service Environment == Devstack/Queens in Single Machine with defaults. Logs & Configs = Logs attached. ** Affects: nova Importance: Undecided Status: New ** Attachment added: "Logs from before to after applying the tests" https://bugs.launchpad.net/bugs/1800204/+attachment/5205956/+files/sys-100p-now.logs -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1800204 Title: n-cpu.service consuming 100% of CPU indeterminately Status in OpenStack Compute (nova): New Bug description: Description == I used fault injection to assess the robustness of the nova-conductor, and by injecting a specific sequence of failures I saw a failure that can threaten the robustness of the system. The resulting of applying these faults in the interface of nova-conductor prevent the nova-compute provisioning new instances. Steps to reproduce = I reproduced this bug 100% from 10 attempts. I used devstack/queens. The workload I used is of the following steps: 1) First, create a VM with the following flavor: 64MB RAM, 1 VCPU, 0 DISK; and the reference image 'cirros.0.3.4' for instance; all other settings can be the defaults of admin account; 2) Rebuild with an alternative image: for instance, 'cirros 0.4.0'; 3) Rebuild with the reference image again; 4) Shelve the instance; 5) Delete the instance; Below, I describe the faultload. For each time a fault is injected, the workload is executed from its begin. The steps are: 1) Intercept the first RPC message (i.e. AMQP) that calls for 'schedule_and_build_instances'; 2) Inject the 'fault' in 'schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.flavor.'nova_object.data'.vcpus' The pseudo-algorithm: 1. execute workload 2. for each fault in ['2', '-101', '100'] 2.1. execute workload in parallel with faultload(fault) 3. see the CPU activity for the process n-cpu.service of devstack Expected result == nova-compute handles the faults not impacting in future requests. Actual result nova-compute consumes 100% of CPU and new instances is set to 'error' state without any clue about the issue, so it is not possible to create new instances without restarting n-cpu.service Environment == Devstack/Queens in Single Machine with defaults. Logs & Configs = Logs attached. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1800204/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp