[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383335#comment-16383335
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
---------------------------------------------

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 Fix 
Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171784246
 
 

 ##########
 File path: 
engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##########
 @@ -843,72 +846,103 @@ protected boolean 
handleDisconnectWithInvestigation(final AgentAttache attache,
                 s_logger.debug("Caught exception while getting agent's next 
status", ne);
             }
 
+            // For log and alert purposes later
+            final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+            final HostPodVO podVO = _podDao.findById(host.getPodId());
+            final String hostDesc = "[name: " + host.getName() + " (id:" + 
host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + 
podVO.getName() + "]";
+            final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";
+
+            final ResourceState resourceState = host.getResourceState();
+            if (resourceState == ResourceState.Disabled || resourceState == 
ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) 
{
+                // If we are in this resourceState, no need to investigate or 
do anything.  AgentMonitor will handle when in these resourceStates
+                s_logger.info(hostShortDesc + " has disconnected with event " 
+ event + ",  but is in Resource State of " + resourceState + ", so doing 
nothing");
+                return true;
+            }
+
             if (nextStatus == Status.Alert) {
-                /* OK, we are going to the bad status, let's see what happened 
*/
-                s_logger.info("Investigating why host " + hostId + " has 
disconnected with event " + event);
+                /* Our next Agent transition state is Alert
+                 * Let's see if the host down or why we had this event
+                 */
+                s_logger.info("Investigating why host " + hostShortDesc + " 
has disconnected with event " + event);
 
                 Status determinedState = investigate(attache);
                 // if state cannot be determined do nothing and bail out
                 if (determinedState == null) {
                     if ((System.currentTimeMillis() >> 10) - 
host.getLastPinged() > AlertWait.value()) {
-                        s_logger.warn("Agent " + hostId + " state cannot be 
determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, 
will go to Alert state");
+                        s_logger.warn("State for " + hostShortDesc + " could 
not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") 
seconds, will go to Alert state");
                         determinedState = Status.Alert;
                     } else {
-                        s_logger.warn("Agent " + hostId + " state cannot be 
determined, do nothing");
+                        s_logger.warn("State for " + hostShortDesc + " could 
not be determined, doing nothing");
                         return false;
                     }
                 }
 
                 final Status currentStatus = host.getStatus();
-                s_logger.info("The agent from host " + hostId + " state 
determined is " + determinedState);
+                s_logger.info("Status for " + hostShortDesc + " was " + 
currentStatus + ".  Investigation determined the current state is " + 
determinedState);
 
-                if (determinedState == Status.Down) {
-                    final String message = "Host is down: " + host.getId() + 
"-" + host.getName() + ". Starting HA on the VMs";
-                    s_logger.error(message);
-                    if (host.getType() != Host.Type.SecondaryStorage && 
host.getType() != Host.Type.ConsoleProxy) {
-                        
_alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, 
host.getDataCenterId(), host.getPodId(), "Host down, " + host.getId(), message);
-                    }
-                    event = Status.Event.HostDown;
-                } else if (determinedState == Status.Up) {
-                    /* Got ping response from host, bring it back */
-                    s_logger.info("Agent is determined to be up and running");
+                if (determinedState == Status.Up) {
+                    // Got ping response from host, bring it back
+                    s_logger.info(hostShortDesc + " is up again");
                     agentStatusTransitTo(host, Status.Event.Ping, _nodeId);
-                    return false;
                 } else if (determinedState == Status.Disconnected) {
-                    s_logger.warn("Agent is disconnected but the host is still 
up: " + host.getId() + "-" + host.getName());
 
 Review comment:
   why must this statement be removed?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> ------------
>
>                 Key: CLOUDSTACK-10246
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Management Server
>    Affects Versions: 4.11.0.0
>         Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>            Reporter: Nux
>            Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to