Public bug reported: Due to a mistake in the SDK patches, the wait_for_active loop tries to check a random node instead of the target node. While it works fine in the default job with only one node at all, it fails in the multinode grenade.
The failure [1] looks a bit cryptic: Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall [-] Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Traceback (most recent call last): Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw) Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall raise exception.InstanceNotFound(instance_id=instance.uuid) Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. This code path is usually hit when a node suddenly goes into AVAILABLE, which clearly shouldn't be the case here. The mistery got solved by looking at ironic-api logs [2]: [Thu Sep 5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated 41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1 switches on core 0) This does not match the expected query: GET /baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After looking at the code more carefully, I've noticed that we're passing instance_id and fields inside a dict as a positional argument. In reality, they are expected as normal keyword arguments. The dict ends up in 'details', which ends up being True. So in reality we're listing all nodes and picking the first one. [1] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz [2] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz ** Affects: nova Importance: Critical Assignee: Dmitry Tantsur (divius) Status: In Progress ** Tags: ironic ** Description changed: Due to a mistake in the SDK patches, the wait_for_active loop tries to check a random node instead of the target node. While it works fine in the default job with only one node at all, it fails in the multinode grenade. - The failure looks a bit cryptic: + The failure [1] looks a bit cryptic: Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall [-] Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Traceback (most recent call last): Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw) Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall raise exception.InstanceNotFound(instance_id=instance.uuid) Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. - Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall + Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. This code path is usually hit when a node suddenly goes into AVAILABLE, which clearly shouldn't be the case here. The mistery got solved by - looking at ironic-api logs: + looking at ironic-api logs [2]: [Thu Sep 5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated 41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1 switches on core 0) This does not match the expected query: GET /baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After looking at the code more carefully, I've noticed that we're passing instance_id and fields inside a dict as a positional argument. In reality, they are expected as normal keyword arguments. The dict ends up in 'details', which ends up being True. So in reality we're listing all nodes and picking the first one. + + [1] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz + [2] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz ** Changed in: nova Assignee: (unassigned) => Dmitry Tantsur (divius) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1843050 Title: Regression in the ironic driver: wait_for_active checks a random node Status in OpenStack Compute (nova): In Progress Bug description: Due to a mistake in the SDK patches, the wait_for_active loop tries to check a random node instead of the target node. While it works fine in the default job with only one node at all, it fails in the multinode grenade. The failure [1] looks a bit cryptic: Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall [-] Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Traceback (most recent call last): Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw) Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall raise exception.InstanceNotFound(instance_id=instance.uuid) Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found. This code path is usually hit when a node suddenly goes into AVAILABLE, which clearly shouldn't be the case here. The mistery got solved by looking at ironic-api logs [2]: [Thu Sep 5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated 41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1 switches on core 0) This does not match the expected query: GET /baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After looking at the code more carefully, I've noticed that we're passing instance_id and fields inside a dict as a positional argument. In reality, they are expected as normal keyword arguments. The dict ends up in 'details', which ends up being True. So in reality we're listing all nodes and picking the first one. [1] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz [2] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1843050/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp