[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640021#comment-14640021
 ] 

ASF GitHub Bot commented on CLOUDSTACK-8666:
--------------------------------------------

Github user asfgit closed the pull request at:

    https://github.com/apache/cloudstack/pull/621


> Put host in Alert state only after alert.wait timeout
> -----------------------------------------------------
>
>                 Key: CLOUDSTACK-8666
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8666
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Management Server
>    Affects Versions: 4.5.0, 4.6.0
>            Reporter: Koushik Das
>            Assignee: Koushik Das
>             Fix For: 4.6.0
>
>
> When there is a ping timeout on a host, investigators try to determine the 
> state of a host. If none of the investigators are able to determine the host 
> state then the process is repeated after some time. This works most of the 
> time except some boundary scenarios. For e.g. if last host or all host in a 
> XS cluster are brought down then the investigators are not able to determine 
> the host state and the investigation process never completes. In such 
> scenarios host state always remain as Up.
> In order to fix these boundary scenarios, a fix was made (refer to commit 
> 4a13f81485c0f0664c60acafe534946e7428f080) to immediately put the host in 
> Alert state if investigators are not able to determine the state after ping 
> timeout.
> The fix solved the boundary scenarios but introduced a new issue. Suppose 
> there is a XS cluster with 2 hosts and the master host is brought down. In 
> this case XS elects a new master for the cluster. Since master is down, 
> investigators won't able to determine host state until a new master is 
> elected. If this master election takes more than ping timeout to complete 
> then the host is put to Alert based on the above fix. Once this happens, the 
> host continues to remain in Alert state and no actions are taken on the VMs 
> on this host. In this case if the investigators were allowed to run for 1 or 
> 2 more times, possibly the new master election would have completed and host 
> state correctly determined.
> In order to fix both these issues, instead of putting the host to Alert state 
> immediately, the investigators should be allowed to run for some time based 
> on alert.wait global config. At the end of this interval if the host state 
> still cannot be determined then put the host in Alert.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to