[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643891#comment-14643891
 ] 

ASF subversion and git services commented on CLOUDSTACK-8666:
-------------------------------------------------------------

Commit 6c3c9ea915b486722c6d41491338531254335272 in cloudstack's branch 
refs/heads/master from [~koushikd]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=6c3c9ea ]

Unit tests for HA manager investigate method. Refer to CLOUDSTACK-8666 for the 
code chenges


> Put host in Alert state only after alert.wait timeout
> -----------------------------------------------------
>
>                 Key: CLOUDSTACK-8666
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8666
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Management Server
>    Affects Versions: 4.5.0, 4.6.0
>            Reporter: Koushik Das
>            Assignee: Koushik Das
>             Fix For: 4.6.0
>
>
> When there is a ping timeout on a host, investigators try to determine the 
> state of a host. If none of the investigators are able to determine the host 
> state then the process is repeated after some time. This works most of the 
> time except some boundary scenarios. For e.g. if last host or all host in a 
> XS cluster are brought down then the investigators are not able to determine 
> the host state and the investigation process never completes. In such 
> scenarios host state always remain as Up.
> In order to fix these boundary scenarios, a fix was made (refer to commit 
> 4a13f81485c0f0664c60acafe534946e7428f080) to immediately put the host in 
> Alert state if investigators are not able to determine the state after ping 
> timeout.
> The fix solved the boundary scenarios but introduced a new issue. Suppose 
> there is a XS cluster with 2 hosts and the master host is brought down. In 
> this case XS elects a new master for the cluster. Since master is down, 
> investigators won't able to determine host state until a new master is 
> elected. If this master election takes more than ping timeout to complete 
> then the host is put to Alert based on the above fix. Once this happens, the 
> host continues to remain in Alert state and no actions are taken on the VMs 
> on this host. In this case if the investigators were allowed to run for 1 or 
> 2 more times, possibly the new master election would have completed and host 
> state correctly determined.
> In order to fix both these issues, instead of putting the host to Alert state 
> immediately, the investigators should be allowed to run for some time based 
> on alert.wait global config. At the end of this interval if the host state 
> still cannot be determined then put the host in Alert.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to