There is known issue [0] in Oslo messaging and it seems resolved in Kilo. But the UX of this one is very sad. For example, each time when your AMQP cluster executed a single node failover and recovered running happy, there is a chance some OpenStack apps, like Nova Compute, may stuck in broken state and only a restat could help to heal them.
The typical log pattern for this broken state of a service is a "Timed out waiting for reply". Hence, it may be a good idea to implement monitoring filters based on that pattern and automatically set an alert status for affected OpenStack services. [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732 -- Best regards, Bogdan Dobrelya, Irc #bogdando _______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators