[jira] [Commented] (CLOUDSTACK-9458) Some VMs are being stopped when agent is reconnecting

ASF GitHub Bot (JIRA) Thu, 18 Aug 2016 01:53:45 -0700

    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426116#comment-15426116
 ]


ASF GitHub Bot commented on CLOUDSTACK-9458:
--------------------------------------------

Github user marcaurele commented on the issue:

    https://github.com/apache/cloudstack/pull/1640
  
    I understand your point of the release, but we're not in an ideal world 
where everyone runs the latest version. I try to do my best to look at the 
current code in CS to find possible fixes of any bug/problem we encounter or 
changes we want to do in our version. I want us to get back to the master 
version but that's not the topic here, neither going to happen in the next 
weeks.
    
    The point 2 does not make sense to me. If the management server cannot 
determine the state of the VM, it could mark them as stopped (*even though I 
don't think it should*). But it should not create a StopVM job, because that 
might trigger a proper stop of the VM if the agent is reconnecting while the 
job is picked by async job workers.
    If the VM is really down because the host has crashed, then the command is 
pointless, and in a customer point of view it would not make a difference. If 
the host is still up and fine, but we have a network glitch, then requesting a 
stop of the VM is really bad in a customer point of view. By not doing 
anything, not requesting a stop, we would end up in a better situation.
    
    Going back to which state should be set on the VM when the management 
server cannot determine it, taking the assumption that the VM is stopped 
because the management server cannot reach the agent is as much incorrect as 
leaving it as it is (running, migrating, creating...). I'd rather create a new 
state `UNKNOWN` for such special case, when the management server does really 
not know. In a management point of view it will be also easier to know there 
are *ghost* VMs somewhere for which the management server cannot determine the 
exact state and proper investigation (*manual*) should be done if the state 
stays like this, regarding the billing part too.


> Some VMs are being stopped when agent is reconnecting
> -----------------------------------------------------
>
>                 Key: CLOUDSTACK-9458
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-9458
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>            Reporter: Marc-Aurèle Brothier
>            Assignee: Marc-Aurèle Brothier
>
> If you loose the communication between the management server and one of the 
> agent for a few minutes, even though HA mode is not active the 
> HighAvailibilityManager kicks in and start to schedule vm restart. Those 
> tasks are being inserted as async job in the DB and if the agent comes back 
> online during the time the jobs are still in the async table, they are pushed 
> to the agent and shuts down the VMs. Then since HA is not active, the VM are 
> not restarted.
> The expected behavior in my opinion is that the VM should not be restarted at 
> all if HA mode is not active on them, and let the agent update the VM state 
> with the power report.
> The bug lies in 
> {{HighAvailibilityManagerImpl.scheduleRestartForVmsOnHost(final HostVO host, 
> boolean investigate)}}, PR will follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CLOUDSTACK-9458) Some VMs are being stopped when agent is reconnecting

Reply via email to