There is known issue [0] in Oslo messaging and it seems resolved in
Kilo. But the UX of this one is very sad. For example, each time when
your AMQP cluster executed a single node failover and recovered running
happy, there is a chance some OpenStack apps, like Nova Compute, may
stuck in broken state and only a restat could help to heal them.

The typical log pattern for this broken state of a service is a "Timed
out waiting for reply". Hence, it may be a good idea to implement
monitoring filters based on that pattern and automatically set an alert
status for affected OpenStack services.

[0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732

-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Reply via email to