[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073335#comment-14073335
 ] 

Tomasz Zieba commented on CLOUDSTACK-6036:
------------------------------------------

Hello,

After a little investigation, we have some facts. Error stopping the machine is 
always associated with an entry in the logs of the content:

2014-07-24 17:00:34,918 DEBUG [vm.dao.VMInstanceDaoImpl] (HA-Worker-2:work-15) 
Unable to update VM[User|Win-HD-tune]: DB Data={Host=null; State=Stopped; 
updated=55; time=Thu Jul 24 17:00:34 CEST 2014} New Data: {Host=2; 
State=Running; updated=55; time=Thu Jul 24 17:00:34 CEST 2014} Stale Data: 
{Host=2; State=Stopping; updated=54; time=Thu Jul 24 17:00:13 CEST 2014}

and then:

2014-07-24 17:00:34,919 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:work-15) Stop was unsuccessful.  Rescheduling
2014-07-24 17:00:34,919 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:work-15) Rescheduling HAWork[15-Stop-10-Stopping-Scheduled] to try 
again at Thu Jul 24 17:10:48 CEST 2014

After analyzing the logs of the database:

633 Query     UPDATE vm_instance SET vm_instance.update_time='2014-07-24 
15:00:34', vm_instance.update_count=vm_instance.update_count+1, 
vm_instance.last_host_id=2, vm_instance.state='Stopped', 
vm_instance.host_id=null, vm_instance.pod_id=1 WHERE vm_instance.id = 10  AND 
vm_instance.state = 'Stopping'  AND vm_instance.host_id = 2  AND 
vm_instance.update_count = 54
633 Query     UPDATE vm_instance SET vm_instance.update_time='2014-07-24 
15:00:34', vm_instance.update_count=vm_instance.update_count+1, 
vm_instance.state='Running', vm_instance.host_id=2, vm_instance.pod_id=1 WHERE 
vm_instance.id = 10  AND vm_instance.state = 'Stopping'  AND 
vm_instance.host_id = 2  AND vm_instance.update_count = 54

As can be seen ACS tries to update the entry twice but twice looking 
vm_instance.update_count field with a value of 54 
So the first update causes the counter vm_instance.update_count increased to 55 
so once the second SQL Update command executed incorrectly.

My guess is that the situation is related to a race condition in the HA-Worker 
threads, which by default is 5 

In the test environment, we try to change ha.workers value to 1 and we will see 
if the situation repeats itself.


>  CloudStack stops the machine for no reason
> -------------------------------------------
>
>                 Key: CLOUDSTACK-6036
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6036
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Management Server
>    Affects Versions: 4.2.1
>         Environment: ACS 4.2.1 after upgrade from 3.0.2
> ACS 4.2.1 clean install
> XCP 1.1
>            Reporter: Tomasz Zieba
>            Assignee: Koushik Das
>            Priority: Critical
>              Labels: 4.5ReviewNeeded
>             Fix For: 4.4.0
>
>         Attachments: management-server.log.2014-02-19.gz, 
> management-server.log.2014-02-20.gz, management-server.log.2014-02-24.txt
>
>
> After the upgrade from version 3.0.2 to 4.2.1 appeared very strange error 
> associated with self-stopping machine after changing the offering. 
> Steps to reproduce: 
> 1. Running instance of the machine 
> 2. Stop with the operating system 
> 3. Change offering of machine
> 4. Start the machine 
> 5. Waiting for 10 minutes 
> 6. CloudStack stops the machine for no reason
> 7. Restart the machine 
> In the logs you can find information:
> 2014-02-05 06:27:00,974 DEBUG [xen.resource.CitrixResourceBase] 
> (DirectAgent-316:null) 11. The VM i-41-824-VM is in Running state
> 2014-02-05 06:27:00,974 DEBUG [agent.transport.Request] 
> (DirectAgent-316:null) Seq 50-1756626952: Processing:  { Ans: , MgmtId: 
> 160544475005497, via: 50, Ver: v1, Flags: 10, 
> [{"com.cloud.agent.api.ClusterSyncAnswer":{"_clusterId":2,"_newStates":{"i-41-824-VM":{"t":"f32a4fee-b64e-4868-9c52-2a27e32d4c0e","u":"Running","v":"viridian:true;acpi:true;apic:true;pae:true;nx:false;timeoffset:0;cores-per-socket:4"}},"_isExecuted":false,"result":true,"wait":0}}]
>  }
> 2014-02-05 06:27:00,981 DEBUG [cloud.vm.VirtualMachineManagerImpl] 
> (DirectAgent-316:null) VM i-41-824-VM: cs state = Running and realState = 
> Running
> 2014-02-05 06:27:00,981 DEBUG [cloud.vm.VirtualMachineManagerImpl] 
> (DirectAgent-316:null) VM i-41-824-VM: cs state = Running and realState = 
> Running
> 2014-02-05 06:36:01,240 DEBUG [agent.transport.Request] 
> (HA-Worker-1:work-1511) Seq 51-1374970375: Sending  { Cmd , MgmtId: 
> 160544475005497, via: 51, Ver: v1, Flags: 100111, 
> [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":true,"vmName":"i-41-824-VM","wait":0}}]
>  }
> 2014-02-05 06:36:01,240 DEBUG [agent.transport.Request] 
> (HA-Worker-1:work-1511) Seq 51-1374970375: Executing:  { Cmd , MgmtId: 
> 160544475005497, via: 51, Ver: v1, Flags: 100111, 
> [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":true,"vmName":"i-41-824-VM","wait":0}}]
>  }
> 2014-02-05 06:36:01,383 DEBUG [xen.resource.CitrixResourceBase] 
> (DirectAgent-150:null) 9. The VM i-41-824-VM is in Stopping state
> 2014-02-05 06:36:27,625 DEBUG [xen.resource.CitrixResourceBase] 
> (DirectAgent-150:null) 10. The VM i-41-824-VM is in Stopped state
> You will notice that the stop of the machine corresponds to the component 
> HA-Worker. 
> Such operation after the upgrade is complicating work of users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to