Public bug reported:
tl;dr The live migration monitoring thread writes migration stats to the
database every 5 seconds and if the live migration runs for a
sufficiently long time, the message queue can become overwhelmed
culminating in a MessagingTimeout raised from the RPC.
This issue was discovered during investigation of a downstream bug [1]
where a live migration was dirtying memory faster than the transfer and
the live migration failed in nova-compute with a MessagingTimeout, for
example:
Live migration failed.: oslo_messaging.exceptions.MessagingTimeout: Timed out
waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
Traceback (most recent call last):
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
425, in get
return self._queues[msg_id].get(block=True, timeout=timeout)
File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
return waiter.wait()
File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
return get_hub().switch()
File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in
switch
return self.greenlet.switch()
queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7466,
in _do_live_migration
block_migration, migrate_data)
File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line
8858, in live_migration
migrate_data)
File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line
9382, in _live_migration
finish_event, disk_paths)
File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line
9256, in _live_migration_monitor
info, remaining)
File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line
557, in save_stats
migration.save()
File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line
210, in wrapper
ctxt, self, fn.__name__, args, kwargs)
File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246,
in object_action
objmethod=objmethod, args=args, kwargs=kwargs)
File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line
181, in call
transport_options=self.transport_options)
File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line
129, in _send
transport_options=transport_options)
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
674, in send
transport_options=transport_options)
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
662, in _send
call_monitor_timeout)
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
551, in wait
message = self.waiters.get(msg_id, timeout=timeout)
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
429, in get
'to message ID %s' % msg_id)
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to
message ID 389a868aba184d1a9956337bca78a569
The bug reporter was able to reproduce the problem in a lab environment
and after we tracked things down to the live migration monitoring thread
database writes as the possible culprit, he was able to run some tests
[2][3] experimenting with changing the database write interval to be
less frequent (once every 30 seconds for example) and the
MessagingTimeout stopped occurring.
Note that the live migration still had other issues due to dirtying
memory faster than the transfer and not having
live_migration_timeout_action set to force_complete but with the slower
database write interval there we no MessagingTimeout errors from the
message queue and the migration failed as expected due to the live
migration itself timing out as it was not able to get ahead of the
memory dirtying.
My thought about how to address this is to make the interval for the
database writes in the live migration monitoring thread configurable, so
that operators can control and tune it to their deployment environments.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c23
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c24
** Affects: nova
Importance: Undecided
Status: New
** Tags: compute libvirt live-migration
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2091138
Title:
Rapid live migration monitor database writes can lead to
MessagingTimeout
Status in OpenStack Compute (nova):
New
Bug description:
tl;dr The live migration monitoring thread writes migration stats to
the database every 5 seconds and if the live migration runs for a
sufficiently long time, the message queue can become overwhelmed
culminating in a MessagingTimeout raised from the RPC.
This issue was discovered during investigation of a downstream bug [1]
where a live migration was dirtying memory faster than the transfer
and the live migration failed in nova-compute with a MessagingTimeout,
for example:
Live migration failed.: oslo_messaging.exceptions.MessagingTimeout: Timed out
waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
Traceback (most recent call last):
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
425, in get
return self._queues[msg_id].get(block=True, timeout=timeout)
File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
return waiter.wait()
File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
return get_hub().switch()
File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in
switch
return self.greenlet.switch()
queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7466,
in _do_live_migration
block_migration, migrate_data)
File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line
8858, in live_migration
migrate_data)
File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line
9382, in _live_migration
finish_event, disk_paths)
File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line
9256, in _live_migration_monitor
info, remaining)
File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py",
line 557, in save_stats
migration.save()
File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line
210, in wrapper
ctxt, self, fn.__name__, args, kwargs)
File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246,
in object_action
objmethod=objmethod, args=args, kwargs=kwargs)
File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line
181, in call
transport_options=self.transport_options)
File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line
129, in _send
transport_options=transport_options)
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
674, in send
transport_options=transport_options)
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
662, in _send
call_monitor_timeout)
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
551, in wait
message = self.waiters.get(msg_id, timeout=timeout)
File
"/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line
429, in get
'to message ID %s' % msg_id)
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to
message ID 389a868aba184d1a9956337bca78a569
The bug reporter was able to reproduce the problem in a lab
environment and after we tracked things down to the live migration
monitoring thread database writes as the possible culprit, he was able
to run some tests [2][3] experimenting with changing the database
write interval to be less frequent (once every 30 seconds for example)
and the MessagingTimeout stopped occurring.
Note that the live migration still had other issues due to dirtying
memory faster than the transfer and not having
live_migration_timeout_action set to force_complete but with the
slower database write interval there we no MessagingTimeout errors
from the message queue and the migration failed as expected due to the
live migration itself timing out as it was not able to get ahead of
the memory dirtying.
My thought about how to address this is to make the interval for the
database writes in the live migration monitoring thread configurable,
so that operators can control and tune it to their deployment
environments.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c23
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c24
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2091138/+subscriptions
--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp