Public bug reported: I've seen this a few times, but start tracking from here:
http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode- live-migration-ubuntu- xenial/49adcde/console.html#_2017-02-16_22_25_43_645525 2017-02-16 22:25:43.645525 | 2017-02-16 22:25:43.645 | tempest.api.compute.admin.test_live_migration.LiveBlockMigrationTestJSON.test_live_block_migration[id-1dce86b8-eb04-4c03-a9d8-9c1dc3ee0c7b] 2017-02-16 22:25:43.646587 | 2017-02-16 22:25:43.646 | ------------------------------------------------------------------------------------------------------------------------------------------- 2017-02-16 22:25:43.647903 | 2017-02-16 22:25:43.647 | 2017-02-16 22:25:43.649473 | 2017-02-16 22:25:43.649 | Captured traceback: 2017-02-16 22:25:43.651373 | 2017-02-16 22:25:43.650 | ~~~~~~~~~~~~~~~~~~~ 2017-02-16 22:25:43.656802 | 2017-02-16 22:25:43.656 | Traceback (most recent call last): 2017-02-16 22:25:43.659061 | 2017-02-16 22:25:43.658 | File "tempest/api/compute/admin/test_live_migration.py", line 122, in test_live_block_migration 2017-02-16 22:25:43.661006 | 2017-02-16 22:25:43.660 | self._test_live_migration() 2017-02-16 22:25:43.671547 | 2017-02-16 22:25:43.671 | File "tempest/api/compute/admin/test_live_migration.py", line 97, in _test_live_migration 2017-02-16 22:25:43.672889 | 2017-02-16 22:25:43.672 | volume_backed=volume_backed)['id'] 2017-02-16 22:25:43.674405 | 2017-02-16 22:25:43.673 | File "tempest/api/compute/base.py", line 232, in create_test_server 2017-02-16 22:25:43.676559 | 2017-02-16 22:25:43.676 | **kwargs) 2017-02-16 22:25:43.681637 | 2017-02-16 22:25:43.680 | File "tempest/common/compute.py", line 182, in create_test_server 2017-02-16 22:25:43.683300 | 2017-02-16 22:25:43.682 | server['id']) 2017-02-16 22:25:43.684766 | 2017-02-16 22:25:43.684 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2017-02-16 22:25:43.686210 | 2017-02-16 22:25:43.685 | self.force_reraise() 2017-02-16 22:25:43.687896 | 2017-02-16 22:25:43.687 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2017-02-16 22:25:43.689493 | 2017-02-16 22:25:43.689 | six.reraise(self.type_, self.value, self.tb) 2017-02-16 22:25:43.691459 | 2017-02-16 22:25:43.691 | File "tempest/common/compute.py", line 164, in create_test_server 2017-02-16 22:25:43.693741 | 2017-02-16 22:25:43.692 | clients.servers_client, server['id'], wait_until) 2017-02-16 22:25:43.695699 | 2017-02-16 22:25:43.695 | File "tempest/common/waiters.py", line 96, in wait_for_server_status 2017-02-16 22:25:43.697692 | 2017-02-16 22:25:43.697 | raise lib_exc.TimeoutException(message) 2017-02-16 22:25:43.699378 | 2017-02-16 22:25:43.698 | tempest.lib.exceptions.TimeoutException: Request timed out 2017-02-16 22:25:43.701086 | 2017-02-16 22:25:43.700 | Details: (LiveBlockMigrationTestJSON:test_live_block_migration) Server 0ee93807-d206-4ddf-878c-efd1dd2eab3c failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: BUILD. Current task state: spawning. I was looking in the n-cpu logs in the subnode for 0ee93807-d206-4ddf- 878c-efd1dd2eab3c and found the last thing we see during the server create is here: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode- live-migration-ubuntu- xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_143 2017-02-16 22:22:11.143 14954 DEBUG nova.virt.libvirt.driver [req- 025bf4d2-e5ba-4236-a334-f9eb98105ada tempest- LiveBlockMigrationTestJSON-793129345 tempest- LiveBlockMigrationTestJSON-793129345] [instance: 0ee93807-d206-4ddf- 878c-efd1dd2eab3c] Instance is running spawn /opt/stack/new/nova/nova/virt/libvirt/driver.py:2689 That's after we've created the domain: https://github.com/openstack/nova/blob/15.0.0.0rc1/nova/virt/libvirt/driver.py#L2689 After that the driver is waiting for the power_state to go to RUNNING. I see that shortly after that log message we get a libvirt event saying the instance is started: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode- live-migration-ubuntu- xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_214 And then right after that it's paused: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode- live-migration-ubuntu- xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_263 Looking in the QEMU logs for that instance: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode- live-migration-ubuntu- xenial/49adcde/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz I see this, which is odd: KVM: entry failed, hardware error 0x0 In the libvirtd logs I see: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode- live-migration-ubuntu- xenial/49adcde/logs/subnode-2/libvirt/libvirtd.txt.gz#_2017-02-16_22_22_10_879 2017-02-16 22:22:10.879+0000: 8973: debug : qemuProcessLaunch:4876 : QEMU vm=0x7f68a40035a0 name=instance-00000001 running with pid=16410 And later: 2017-02-16 22:22:11.139+0000: 8968: debug : qemuProcessHandleResume:765 : Transitioned guest instance-00000001 out of paused into resumed state ... 2017-02-16 22:22:11.140+0000: 8968: debug : qemuProcessHandleStop:716 : Transitioned guest instance-00000001 to paused state But I don't see any errors in the libvirt logs. Right around that time I see this in syslog: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode- live-migration-ubuntu- xenial/49adcde/logs/subnode-2/syslog.txt.gz#_Feb_16_22_22_11 Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: libvirt version: 1.3.1, package: 1ubuntu10.8 (Christian Ehrhardt <christian.ehrha...@canonical.com> Mon, 06 Feb 2017 14:30:46 +0100) Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: hostname: ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: End of file while reading data: Input/output error I guess the virtlogd i/o error is pretty normal, it's all over syslogs in our CI runs, and maybe is a result of the hw error in the guest? ** Affects: nova Importance: High Status: Triaged ** Summary changed: - Live migration tests sometimes timeout waiting for instance to be ACTIVE + Live migration tests sometimes timeout waiting for instance to be ACTIVE: "KVM: entry failed, hardware error 0x0" -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1665487 Title: Live migration tests sometimes timeout waiting for instance to be ACTIVE: "KVM: entry failed, hardware error 0x0" Status in OpenStack Compute (nova): Triaged Bug description: I've seen this a few times, but start tracking from here: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm- multinode-live-migration-ubuntu- xenial/49adcde/console.html#_2017-02-16_22_25_43_645525 2017-02-16 22:25:43.645525 | 2017-02-16 22:25:43.645 | tempest.api.compute.admin.test_live_migration.LiveBlockMigrationTestJSON.test_live_block_migration[id-1dce86b8-eb04-4c03-a9d8-9c1dc3ee0c7b] 2017-02-16 22:25:43.646587 | 2017-02-16 22:25:43.646 | ------------------------------------------------------------------------------------------------------------------------------------------- 2017-02-16 22:25:43.647903 | 2017-02-16 22:25:43.647 | 2017-02-16 22:25:43.649473 | 2017-02-16 22:25:43.649 | Captured traceback: 2017-02-16 22:25:43.651373 | 2017-02-16 22:25:43.650 | ~~~~~~~~~~~~~~~~~~~ 2017-02-16 22:25:43.656802 | 2017-02-16 22:25:43.656 | Traceback (most recent call last): 2017-02-16 22:25:43.659061 | 2017-02-16 22:25:43.658 | File "tempest/api/compute/admin/test_live_migration.py", line 122, in test_live_block_migration 2017-02-16 22:25:43.661006 | 2017-02-16 22:25:43.660 | self._test_live_migration() 2017-02-16 22:25:43.671547 | 2017-02-16 22:25:43.671 | File "tempest/api/compute/admin/test_live_migration.py", line 97, in _test_live_migration 2017-02-16 22:25:43.672889 | 2017-02-16 22:25:43.672 | volume_backed=volume_backed)['id'] 2017-02-16 22:25:43.674405 | 2017-02-16 22:25:43.673 | File "tempest/api/compute/base.py", line 232, in create_test_server 2017-02-16 22:25:43.676559 | 2017-02-16 22:25:43.676 | **kwargs) 2017-02-16 22:25:43.681637 | 2017-02-16 22:25:43.680 | File "tempest/common/compute.py", line 182, in create_test_server 2017-02-16 22:25:43.683300 | 2017-02-16 22:25:43.682 | server['id']) 2017-02-16 22:25:43.684766 | 2017-02-16 22:25:43.684 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2017-02-16 22:25:43.686210 | 2017-02-16 22:25:43.685 | self.force_reraise() 2017-02-16 22:25:43.687896 | 2017-02-16 22:25:43.687 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2017-02-16 22:25:43.689493 | 2017-02-16 22:25:43.689 | six.reraise(self.type_, self.value, self.tb) 2017-02-16 22:25:43.691459 | 2017-02-16 22:25:43.691 | File "tempest/common/compute.py", line 164, in create_test_server 2017-02-16 22:25:43.693741 | 2017-02-16 22:25:43.692 | clients.servers_client, server['id'], wait_until) 2017-02-16 22:25:43.695699 | 2017-02-16 22:25:43.695 | File "tempest/common/waiters.py", line 96, in wait_for_server_status 2017-02-16 22:25:43.697692 | 2017-02-16 22:25:43.697 | raise lib_exc.TimeoutException(message) 2017-02-16 22:25:43.699378 | 2017-02-16 22:25:43.698 | tempest.lib.exceptions.TimeoutException: Request timed out 2017-02-16 22:25:43.701086 | 2017-02-16 22:25:43.700 | Details: (LiveBlockMigrationTestJSON:test_live_block_migration) Server 0ee93807-d206-4ddf-878c-efd1dd2eab3c failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: BUILD. Current task state: spawning. I was looking in the n-cpu logs in the subnode for 0ee93807-d206-4ddf- 878c-efd1dd2eab3c and found the last thing we see during the server create is here: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm- multinode-live-migration-ubuntu- xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_143 2017-02-16 22:22:11.143 14954 DEBUG nova.virt.libvirt.driver [req- 025bf4d2-e5ba-4236-a334-f9eb98105ada tempest- LiveBlockMigrationTestJSON-793129345 tempest- LiveBlockMigrationTestJSON-793129345] [instance: 0ee93807-d206-4ddf- 878c-efd1dd2eab3c] Instance is running spawn /opt/stack/new/nova/nova/virt/libvirt/driver.py:2689 That's after we've created the domain: https://github.com/openstack/nova/blob/15.0.0.0rc1/nova/virt/libvirt/driver.py#L2689 After that the driver is waiting for the power_state to go to RUNNING. I see that shortly after that log message we get a libvirt event saying the instance is started: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm- multinode-live-migration-ubuntu- xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_214 And then right after that it's paused: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm- multinode-live-migration-ubuntu- xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_263 Looking in the QEMU logs for that instance: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm- multinode-live-migration-ubuntu- xenial/49adcde/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz I see this, which is odd: KVM: entry failed, hardware error 0x0 In the libvirtd logs I see: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm- multinode-live-migration-ubuntu- xenial/49adcde/logs/subnode-2/libvirt/libvirtd.txt.gz#_2017-02-16_22_22_10_879 2017-02-16 22:22:10.879+0000: 8973: debug : qemuProcessLaunch:4876 : QEMU vm=0x7f68a40035a0 name=instance-00000001 running with pid=16410 And later: 2017-02-16 22:22:11.139+0000: 8968: debug : qemuProcessHandleResume:765 : Transitioned guest instance-00000001 out of paused into resumed state ... 2017-02-16 22:22:11.140+0000: 8968: debug : qemuProcessHandleStop:716 : Transitioned guest instance-00000001 to paused state But I don't see any errors in the libvirt logs. Right around that time I see this in syslog: http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm- multinode-live-migration-ubuntu- xenial/49adcde/logs/subnode-2/syslog.txt.gz#_Feb_16_22_22_11 Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: libvirt version: 1.3.1, package: 1ubuntu10.8 (Christian Ehrhardt <christian.ehrha...@canonical.com> Mon, 06 Feb 2017 14:30:46 +0100) Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: hostname: ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: End of file while reading data: Input/output error I guess the virtlogd i/o error is pretty normal, it's all over syslogs in our CI runs, and maybe is a result of the hw error in the guest? To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1665487/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp