[Yahoo-eng-team] [Bug 1665487] [NEW] Live migration tests sometimes timeout waiting for instance to be ACTIVE: "KVM: entry failed, hardware error 0x0"

Matt Riedemann Thu, 16 Feb 2017 15:51:53 -0800

Public bug reported:

I've seen this a few times, but start tracking from here:


http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode-
live-migration-ubuntu-
xenial/49adcde/console.html#_2017-02-16_22_25_43_645525

2017-02-16 22:25:43.645525 | 2017-02-16 22:25:43.645 | 
tempest.api.compute.admin.test_live_migration.LiveBlockMigrationTestJSON.test_live_block_migration[id-1dce86b8-eb04-4c03-a9d8-9c1dc3ee0c7b]
2017-02-16 22:25:43.646587 | 2017-02-16 22:25:43.646 | 
-------------------------------------------------------------------------------------------------------------------------------------------
2017-02-16 22:25:43.647903 | 2017-02-16 22:25:43.647 | 
2017-02-16 22:25:43.649473 | 2017-02-16 22:25:43.649 | Captured traceback:
2017-02-16 22:25:43.651373 | 2017-02-16 22:25:43.650 | ~~~~~~~~~~~~~~~~~~~
2017-02-16 22:25:43.656802 | 2017-02-16 22:25:43.656 |     Traceback (most 
recent call last):
2017-02-16 22:25:43.659061 | 2017-02-16 22:25:43.658 |       File 
"tempest/api/compute/admin/test_live_migration.py", line 122, in 
test_live_block_migration
2017-02-16 22:25:43.661006 | 2017-02-16 22:25:43.660 |         
self._test_live_migration()
2017-02-16 22:25:43.671547 | 2017-02-16 22:25:43.671 |       File 
"tempest/api/compute/admin/test_live_migration.py", line 97, in 
_test_live_migration
2017-02-16 22:25:43.672889 | 2017-02-16 22:25:43.672 |         
volume_backed=volume_backed)['id']
2017-02-16 22:25:43.674405 | 2017-02-16 22:25:43.673 |       File 
"tempest/api/compute/base.py", line 232, in create_test_server
2017-02-16 22:25:43.676559 | 2017-02-16 22:25:43.676 |         **kwargs)
2017-02-16 22:25:43.681637 | 2017-02-16 22:25:43.680 |       File 
"tempest/common/compute.py", line 182, in create_test_server
2017-02-16 22:25:43.683300 | 2017-02-16 22:25:43.682 |         server['id'])
2017-02-16 22:25:43.684766 | 2017-02-16 22:25:43.684 |       File 
"/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
 line 220, in __exit__
2017-02-16 22:25:43.686210 | 2017-02-16 22:25:43.685 |         
self.force_reraise()
2017-02-16 22:25:43.687896 | 2017-02-16 22:25:43.687 |       File 
"/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
 line 196, in force_reraise
2017-02-16 22:25:43.689493 | 2017-02-16 22:25:43.689 |         
six.reraise(self.type_, self.value, self.tb)
2017-02-16 22:25:43.691459 | 2017-02-16 22:25:43.691 |       File 
"tempest/common/compute.py", line 164, in create_test_server
2017-02-16 22:25:43.693741 | 2017-02-16 22:25:43.692 |         
clients.servers_client, server['id'], wait_until)
2017-02-16 22:25:43.695699 | 2017-02-16 22:25:43.695 |       File 
"tempest/common/waiters.py", line 96, in wait_for_server_status
2017-02-16 22:25:43.697692 | 2017-02-16 22:25:43.697 |         raise 
lib_exc.TimeoutException(message)
2017-02-16 22:25:43.699378 | 2017-02-16 22:25:43.698 |     
tempest.lib.exceptions.TimeoutException: Request timed out
2017-02-16 22:25:43.701086 | 2017-02-16 22:25:43.700 |     Details: 
(LiveBlockMigrationTestJSON:test_live_block_migration) Server 
0ee93807-d206-4ddf-878c-efd1dd2eab3c failed to reach ACTIVE status and task 
state "None" within the required time (196 s). Current status: BUILD. Current 
task state: spawning.

I was looking in the n-cpu logs in the subnode for 0ee93807-d206-4ddf-
878c-efd1dd2eab3c and found the last thing we see during the server
create is here:

http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode-
live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_143

2017-02-16 22:22:11.143 14954 DEBUG nova.virt.libvirt.driver [req-
025bf4d2-e5ba-4236-a334-f9eb98105ada tempest-
LiveBlockMigrationTestJSON-793129345 tempest-
LiveBlockMigrationTestJSON-793129345] [instance: 0ee93807-d206-4ddf-
878c-efd1dd2eab3c] Instance is running spawn
/opt/stack/new/nova/nova/virt/libvirt/driver.py:2689

That's after we've created the domain:

https://github.com/openstack/nova/blob/15.0.0.0rc1/nova/virt/libvirt/driver.py#L2689

After that the driver is waiting for the power_state to go to RUNNING.

I see that shortly after that log message we get a libvirt event saying
the instance is started:

http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode-
live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_214

And then right after that it's paused:

http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode-
live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_263

Looking in the QEMU logs for that instance:

http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode-
live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz

I see this, which is odd:

KVM: entry failed, hardware error 0x0

In the libvirtd logs I see:

http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode-
live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/libvirt/libvirtd.txt.gz#_2017-02-16_22_22_10_879

2017-02-16 22:22:10.879+0000: 8973: debug : qemuProcessLaunch:4876 :
QEMU vm=0x7f68a40035a0 name=instance-00000001 running with pid=16410

And later:

2017-02-16 22:22:11.139+0000: 8968: debug : qemuProcessHandleResume:765 : 
Transitioned guest instance-00000001 out of paused into resumed state
...
2017-02-16 22:22:11.140+0000: 8968: debug : qemuProcessHandleStop:716 : 
Transitioned guest instance-00000001 to paused state

But I don't see any errors in the libvirt logs.

Right around that time I see this in syslog:

http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-multinode-
live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/syslog.txt.gz#_Feb_16_22_22_11

Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: 
libvirt version: 1.3.1, package: 1ubuntu10.8 (Christian Ehrhardt 
<christian.ehrha...@canonical.com> Mon, 06 Feb 2017 14:30:46 +0100)
Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: 
hostname: ubuntu-xenial-2-node-ovh-bhs1-7345556-429938
Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: 
End of file while reading data: Input/output error

I guess the virtlogd i/o error is pretty normal, it's all over syslogs
in our CI runs, and maybe is a result of the hw error in the guest?

** Affects: nova
     Importance: High
         Status: Triaged

** Summary changed:

- Live migration tests sometimes timeout waiting for instance to be ACTIVE
+ Live migration tests sometimes timeout waiting for instance to be ACTIVE: 
"KVM: entry failed, hardware error 0x0"

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1665487

Title:
  Live migration tests sometimes timeout waiting for instance to be
  ACTIVE: "KVM: entry failed, hardware error 0x0"

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  I've seen this a few times, but start tracking from here:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/console.html#_2017-02-16_22_25_43_645525

  2017-02-16 22:25:43.645525 | 2017-02-16 22:25:43.645 | 
tempest.api.compute.admin.test_live_migration.LiveBlockMigrationTestJSON.test_live_block_migration[id-1dce86b8-eb04-4c03-a9d8-9c1dc3ee0c7b]
  2017-02-16 22:25:43.646587 | 2017-02-16 22:25:43.646 | 
-------------------------------------------------------------------------------------------------------------------------------------------
  2017-02-16 22:25:43.647903 | 2017-02-16 22:25:43.647 | 
  2017-02-16 22:25:43.649473 | 2017-02-16 22:25:43.649 | Captured traceback:
  2017-02-16 22:25:43.651373 | 2017-02-16 22:25:43.650 | ~~~~~~~~~~~~~~~~~~~
  2017-02-16 22:25:43.656802 | 2017-02-16 22:25:43.656 |     Traceback (most 
recent call last):
  2017-02-16 22:25:43.659061 | 2017-02-16 22:25:43.658 |       File 
"tempest/api/compute/admin/test_live_migration.py", line 122, in 
test_live_block_migration
  2017-02-16 22:25:43.661006 | 2017-02-16 22:25:43.660 |         
self._test_live_migration()
  2017-02-16 22:25:43.671547 | 2017-02-16 22:25:43.671 |       File 
"tempest/api/compute/admin/test_live_migration.py", line 97, in 
_test_live_migration
  2017-02-16 22:25:43.672889 | 2017-02-16 22:25:43.672 |         
volume_backed=volume_backed)['id']
  2017-02-16 22:25:43.674405 | 2017-02-16 22:25:43.673 |       File 
"tempest/api/compute/base.py", line 232, in create_test_server
  2017-02-16 22:25:43.676559 | 2017-02-16 22:25:43.676 |         **kwargs)
  2017-02-16 22:25:43.681637 | 2017-02-16 22:25:43.680 |       File 
"tempest/common/compute.py", line 182, in create_test_server
  2017-02-16 22:25:43.683300 | 2017-02-16 22:25:43.682 |         server['id'])
  2017-02-16 22:25:43.684766 | 2017-02-16 22:25:43.684 |       File 
"/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
 line 220, in __exit__
  2017-02-16 22:25:43.686210 | 2017-02-16 22:25:43.685 |         
self.force_reraise()
  2017-02-16 22:25:43.687896 | 2017-02-16 22:25:43.687 |       File 
"/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py",
 line 196, in force_reraise
  2017-02-16 22:25:43.689493 | 2017-02-16 22:25:43.689 |         
six.reraise(self.type_, self.value, self.tb)
  2017-02-16 22:25:43.691459 | 2017-02-16 22:25:43.691 |       File 
"tempest/common/compute.py", line 164, in create_test_server
  2017-02-16 22:25:43.693741 | 2017-02-16 22:25:43.692 |         
clients.servers_client, server['id'], wait_until)
  2017-02-16 22:25:43.695699 | 2017-02-16 22:25:43.695 |       File 
"tempest/common/waiters.py", line 96, in wait_for_server_status
  2017-02-16 22:25:43.697692 | 2017-02-16 22:25:43.697 |         raise 
lib_exc.TimeoutException(message)
  2017-02-16 22:25:43.699378 | 2017-02-16 22:25:43.698 |     
tempest.lib.exceptions.TimeoutException: Request timed out
  2017-02-16 22:25:43.701086 | 2017-02-16 22:25:43.700 |     Details: 
(LiveBlockMigrationTestJSON:test_live_block_migration) Server 
0ee93807-d206-4ddf-878c-efd1dd2eab3c failed to reach ACTIVE status and task 
state "None" within the required time (196 s). Current status: BUILD. Current 
task state: spawning.

  I was looking in the n-cpu logs in the subnode for 0ee93807-d206-4ddf-
  878c-efd1dd2eab3c and found the last thing we see during the server
  create is here:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_143

  2017-02-16 22:22:11.143 14954 DEBUG nova.virt.libvirt.driver [req-
  025bf4d2-e5ba-4236-a334-f9eb98105ada tempest-
  LiveBlockMigrationTestJSON-793129345 tempest-
  LiveBlockMigrationTestJSON-793129345] [instance: 0ee93807-d206-4ddf-
  878c-efd1dd2eab3c] Instance is running spawn
  /opt/stack/new/nova/nova/virt/libvirt/driver.py:2689

  That's after we've created the domain:

  
https://github.com/openstack/nova/blob/15.0.0.0rc1/nova/virt/libvirt/driver.py#L2689

  After that the driver is waiting for the power_state to go to RUNNING.

  I see that shortly after that log message we get a libvirt event
  saying the instance is started:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_214

  And then right after that it's paused:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_263

  Looking in the QEMU logs for that instance:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz

  I see this, which is odd:

  KVM: entry failed, hardware error 0x0

  In the libvirtd logs I see:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/libvirt/libvirtd.txt.gz#_2017-02-16_22_22_10_879

  2017-02-16 22:22:10.879+0000: 8973: debug : qemuProcessLaunch:4876 :
  QEMU vm=0x7f68a40035a0 name=instance-00000001 running with pid=16410

  And later:

  2017-02-16 22:22:11.139+0000: 8968: debug : qemuProcessHandleResume:765 : 
Transitioned guest instance-00000001 out of paused into resumed state
  ...
  2017-02-16 22:22:11.140+0000: 8968: debug : qemuProcessHandleStop:716 : 
Transitioned guest instance-00000001 to paused state

  But I don't see any errors in the libvirt logs.

  Right around that time I see this in syslog:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/syslog.txt.gz#_Feb_16_22_22_11

  Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: 
libvirt version: 1.3.1, package: 1ubuntu10.8 (Christian Ehrhardt 
<christian.ehrha...@canonical.com> Mon, 06 Feb 2017 14:30:46 +0100)
  Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: 
hostname: ubuntu-xenial-2-node-ovh-bhs1-7345556-429938
  Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: 
End of file while reading data: Input/output error

  I guess the virtlogd i/o error is pretty normal, it's all over syslogs
  in our CI runs, and maybe is a result of the hw error in the guest?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1665487/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1665487] [NEW] Live migration tests sometimes timeout waiting for instance to be ACTIVE: "KVM: entry failed, hardware error 0x0"

Reply via email to