I'm not sure we can rely on IPMI to tell us much about the host status itself. It's easy to use it for checking on basic poweron/poweroff, temperature, etc, but not so easy to tell if something is wrong with the OS, config, or at the software level.
However, I did mention support in that thread early on for sending an IPMI poweroff for hosts that cloudstack has determined are down and starts migrating vms for, as a safety precaution. On Wed, Aug 7, 2013 at 2:41 PM, Marcus Sorensen <shadow...@gmail.com> wrote: > Does KVMInvestigator work on all shared primary storage, or just NFS? > I'm only familiar with the NFS KVMHA directories. > > From this it seems like a clean stop of the KVM agent still shouldn't > trigger any issues/HA, correct? > > On Wed, Aug 7, 2013 at 2:28 PM, Edison Su <edison...@citrix.com> wrote: >> There is long time issue related to KVM HA, see bug: CLOUDSTACK-3535. >> Basically, HA won't be triggered, if KVM agent is stopped either normally >> nor abnormally, HA only be triggered if the network between mgt server and >> kvm host is disconnected and the network between KVM hosts in the same >> cluster is disconnected. >> Here is how the KVM HA works after the fix for CLOUDSTACK-3535: >> 1. If agent is stopped, agent will send a shutdown request to mgt server, >> mgt server will mark the host as disconnected, while still maintain the host >> in pingmap. Code is in AgentManagerImpl->AgentHandler- >ProcessRequest-> >> disconnectWithoutInvestigation >> 2. After ping.interval, mgt server will find the host is ping timeout, then >> start HA investigation for the host. Code is in AgentMonitor->run-> >> disconnectWithInvestigation >> 3. Mgt server will call all the available Investigators to investigate the >> status of host. >> The current investigators will be called for KVM host: >> UserVmDomRInvestigator->isAgentAlive, will send PingTestCommand to >> the host's neighbor. PingTestCommand will ping host's private ip address, if >> ping is reachable, means host is up, otherwise, host's state is unknown. So >> this investigator can only detect host is in up state. >> KVMInvestigator, which is newly added, will send a >> CheckOnHostCommand to host's neighbor. CheckOnHostCommand will check the >> heartbeat of host(heartbeat is stored on shared primary storage). Ideally, >> it will detect host is down or up. >> >> Combined with UserVmDomRInvestigator and KVMInvestigator, mgt server >> should find out the status of host. But there is case, these two >> investigators can report wrong status of host: >> Host is in a network partition, while the KVM agent is down(thus >> heartbeat is stopped) >> 4. After investigator reports status of host, if host is down, then start HA >> for VMs created on the host. >> >> >> Improvement: >> Per suggestion from Lennert den Teuling, we'd better use IPMI to >> detect host status, which is more reliable than ping and heartbeat, as IPMI >> has its own network, less likely has network partition. >> >>