>From looking at the dstat output, the node in question is above load avg of 11 for nearly 2 hours, about an hour into it is where your error happens.
Realistically, that's just too much work being asked of the node. We have found in the gate that once you get sustained load average over 10 things start to break down. There is no bug fix for this, it's just a fallout of our architecture. Marking as won't fix, as I don't think there is anything actionable here. If you have performance improvements in your environment that make this better, that's great. However there are bounds in which the nova compute worker just does fail over, and there is not much to be done about it. ** Changed in: nova Status: New => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1547544 Title: heat: MessagingTimeout: Timed out waiting for a reply to message ID Status in OpenStack Compute (nova): Won't Fix Status in oslo.messaging: New Bug description: Setup: Single controller[48 GB RAM, 16vCPU, 120GB Disk] 3 Network Nodes 100 ESX hypervisors distributed in 10 nova-compute nodes Test: 1. Create /16 network 2. Heat template which which will launch 100 instances on network created step 1 3. Create 10 stack back2back so that we reach 1000 instances without waiting for previous stack to complete Observation: stack creations are failing while nova run_periodic_tasks at different places like _heal_instance_info_cache, _sync_scheduler_instance_info, _update_available_resource etc Have attached sample heat template, heat logs, nova compute log from one of the host. Logs: 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner 2016-02-19 04:21:54.691 TRACE nova.compute.manager return f(*args, **kwargs) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 553, in _update_available_resource 2016-02-19 04:21:54.691 TRACE nova.compute.manager context, self.host, self.nodename) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 174, in wrapper 2016-02-19 04:21:54.691 TRACE nova.compute.manager args, kwargs) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/opt/stack/nova/nova/conductor/rpcapi.py", line 240, in object_class_action_versions 2016-02-19 04:21:54.691 TRACE nova.compute.manager args=args, kwargs=kwargs) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call 2016-02-19 04:21:54.691 TRACE nova.compute.manager retry=self.retry) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send 2016-02-19 04:21:54.691 TRACE nova.compute.manager timeout=timeout, retry=retry) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 465, in send 2016-02-19 04:21:54.691 TRACE nova.compute.manager retry=retry) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 454, in _send 2016-02-19 04:21:54.691 TRACE nova.compute.manager result = self._waiter.wait(msg_id, timeout) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 337, in wait 2016-02-19 04:21:54.691 TRACE nova.compute.manager message = self.waiters.get(msg_id, timeout=timeout) 2016-02-19 04:21:54.691 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 239, in get 2016-02-19 04:21:54.691 TRACE nova.compute.manager 'to message ID %s' % msg_id) 2016-02-19 04:21:54.691 TRACE nova.compute.manager MessagingTimeout: Timed out waiting for a reply to message ID a87a7f358a0948efa3ab5beb0c8f45e7 -- stack@esx-compute-9:/opt/stack/nova$ git log -1 commit d51c5670d8d26e989d92eb29658eed8113034c0f Merge: 4fade90 30d5d80 Author: Jenkins <jenk...@review.openstack.org> Date: Thu Feb 18 17:56:32 2016 +0000 Merge "reset task_state after select_destinations failed." stack@esx-compute-9:/opt/stack/nova$ To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1547544/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp