[ https://issues.apache.org/jira/browse/CLOUDSTACK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639882#comment-14639882 ]
ASF GitHub Bot commented on CLOUDSTACK-8666: -------------------------------------------- Github user kishankavala commented on the pull request: https://github.com/apache/cloudstack/pull/621#issuecomment-124323020 LGTM > Put host in Alert state only after alert.wait timeout > ----------------------------------------------------- > > Key: CLOUDSTACK-8666 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8666 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Management Server > Affects Versions: 4.5.0, 4.6.0 > Reporter: Koushik Das > Assignee: Koushik Das > Fix For: 4.6.0 > > > When there is a ping timeout on a host, investigators try to determine the > state of a host. If none of the investigators are able to determine the host > state then the process is repeated after some time. This works most of the > time except some boundary scenarios. For e.g. if last host or all host in a > XS cluster are brought down then the investigators are not able to determine > the host state and the investigation process never completes. In such > scenarios host state always remain as Up. > In order to fix these boundary scenarios, a fix was made (refer to commit > 4a13f81485c0f0664c60acafe534946e7428f080) to immediately put the host in > Alert state if investigators are not able to determine the state after ping > timeout. > The fix solved the boundary scenarios but introduced a new issue. Suppose > there is a XS cluster with 2 hosts and the master host is brought down. In > this case XS elects a new master for the cluster. Since master is down, > investigators won't able to determine host state until a new master is > elected. If this master election takes more than ping timeout to complete > then the host is put to Alert based on the above fix. Once this happens, the > host continues to remain in Alert state and no actions are taken on the VMs > on this host. In this case if the investigators were allowed to run for 1 or > 2 more times, possibly the new master election would have completed and host > state correctly determined. > In order to fix both these issues, instead of putting the host to Alert state > immediately, the investigators should be allowed to run for some time based > on alert.wait global config. At the end of this interval if the host state > still cannot be determined then put the host in Alert. -- This message was sent by Atlassian JIRA (v6.3.4#6332)