Does KVMInvestigator work on all shared primary storage, or just NFS? I'm only familiar with the NFS KVMHA directories.
>From this it seems like a clean stop of the KVM agent still shouldn't trigger any issues/HA, correct? On Wed, Aug 7, 2013 at 2:28 PM, Edison Su <[email protected]> wrote: > There is long time issue related to KVM HA, see bug: CLOUDSTACK-3535. > Basically, HA won't be triggered, if KVM agent is stopped either normally nor > abnormally, HA only be triggered if the network between mgt server and kvm > host is disconnected and the network between KVM hosts in the same cluster is > disconnected. > Here is how the KVM HA works after the fix for CLOUDSTACK-3535: > 1. If agent is stopped, agent will send a shutdown request to mgt server, mgt > server will mark the host as disconnected, while still maintain the host in > pingmap. Code is in AgentManagerImpl->AgentHandler- >ProcessRequest-> > disconnectWithoutInvestigation > 2. After ping.interval, mgt server will find the host is ping timeout, then > start HA investigation for the host. Code is in AgentMonitor->run-> > disconnectWithInvestigation > 3. Mgt server will call all the available Investigators to investigate the > status of host. > The current investigators will be called for KVM host: > UserVmDomRInvestigator->isAgentAlive, will send PingTestCommand to > the host's neighbor. PingTestCommand will ping host's private ip address, if > ping is reachable, means host is up, otherwise, host's state is unknown. So > this investigator can only detect host is in up state. > KVMInvestigator, which is newly added, will send a > CheckOnHostCommand to host's neighbor. CheckOnHostCommand will check the > heartbeat of host(heartbeat is stored on shared primary storage). Ideally, it > will detect host is down or up. > > Combined with UserVmDomRInvestigator and KVMInvestigator, mgt server > should find out the status of host. But there is case, these two > investigators can report wrong status of host: > Host is in a network partition, while the KVM agent is down(thus > heartbeat is stopped) > 4. After investigator reports status of host, if host is down, then start HA > for VMs created on the host. > > > Improvement: > Per suggestion from Lennert den Teuling, we'd better use IPMI to detect > host status, which is more reliable than ping and heartbeat, as IPMI has its > own network, less likely has network partition. > >
