Nova now can detect host unreachable. But it fails to make out host isolation,
host dead and nova compute service down. When host unreachable is reported,
users have to find out the exact state by himself and then take the appropriate
measure to recover. Therefore we'd like to improve the host detection for nova.
Currently the service group API factors out the host detection and makes it a
set of abstract internal APIs with a pluggable backend implementation. The
backend we designed is as follows:
A detection central agent is introduced. When a member joins into the service
group, the member host starts to send network heartbeat to the central agent
and writes timestamp in shared storage periodically. When the central agent
stops receiving the network heartbeats from a member, it pings the member and
checks the storage heartbeat before declaring the host to have failed.
----------------------------------------------------------------------------------------------------------------
network heartbeat|network ping|storage heartbeat| state | reason
------------------------|-----------------|------------------------|---------------------------|------------------------------------------
OK | - | - | Running | -
Not OK | Not OK | Not OK | Dead | hardware
failure/abnormal host shut down
Not OK | OK | Not OK | Service unreachable| service
process crashed
Not OK | Not OK | OK | Isolated | network
unreachable
----------------------------------------------------------------------------------------------------------------
Based on the state recognition table, nova can discern the exact host state and
assign the reasons.
Thoughts?
Jenny
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev