[ https://issues.apache.org/jira/browse/CLOUDSTACK-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073335#comment-14073335 ]
Tomasz Zieba commented on CLOUDSTACK-6036: ------------------------------------------ Hello, After a little investigation, we have some facts. Error stopping the machine is always associated with an entry in the logs of the content: 2014-07-24 17:00:34,918 DEBUG [vm.dao.VMInstanceDaoImpl] (HA-Worker-2:work-15) Unable to update VM[User|Win-HD-tune]: DB Data={Host=null; State=Stopped; updated=55; time=Thu Jul 24 17:00:34 CEST 2014} New Data: {Host=2; State=Running; updated=55; time=Thu Jul 24 17:00:34 CEST 2014} Stale Data: {Host=2; State=Stopping; updated=54; time=Thu Jul 24 17:00:13 CEST 2014} and then: 2014-07-24 17:00:34,919 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:work-15) Stop was unsuccessful. Rescheduling 2014-07-24 17:00:34,919 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:work-15) Rescheduling HAWork[15-Stop-10-Stopping-Scheduled] to try again at Thu Jul 24 17:10:48 CEST 2014 After analyzing the logs of the database: 633 Query UPDATE vm_instance SET vm_instance.update_time='2014-07-24 15:00:34', vm_instance.update_count=vm_instance.update_count+1, vm_instance.last_host_id=2, vm_instance.state='Stopped', vm_instance.host_id=null, vm_instance.pod_id=1 WHERE vm_instance.id = 10 AND vm_instance.state = 'Stopping' AND vm_instance.host_id = 2 AND vm_instance.update_count = 54 633 Query UPDATE vm_instance SET vm_instance.update_time='2014-07-24 15:00:34', vm_instance.update_count=vm_instance.update_count+1, vm_instance.state='Running', vm_instance.host_id=2, vm_instance.pod_id=1 WHERE vm_instance.id = 10 AND vm_instance.state = 'Stopping' AND vm_instance.host_id = 2 AND vm_instance.update_count = 54 As can be seen ACS tries to update the entry twice but twice looking vm_instance.update_count field with a value of 54 So the first update causes the counter vm_instance.update_count increased to 55 so once the second SQL Update command executed incorrectly. My guess is that the situation is related to a race condition in the HA-Worker threads, which by default is 5 In the test environment, we try to change ha.workers value to 1 and we will see if the situation repeats itself. > CloudStack stops the machine for no reason > ------------------------------------------- > > Key: CLOUDSTACK-6036 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6036 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Management Server > Affects Versions: 4.2.1 > Environment: ACS 4.2.1 after upgrade from 3.0.2 > ACS 4.2.1 clean install > XCP 1.1 > Reporter: Tomasz Zieba > Assignee: Koushik Das > Priority: Critical > Labels: 4.5ReviewNeeded > Fix For: 4.4.0 > > Attachments: management-server.log.2014-02-19.gz, > management-server.log.2014-02-20.gz, management-server.log.2014-02-24.txt > > > After the upgrade from version 3.0.2 to 4.2.1 appeared very strange error > associated with self-stopping machine after changing the offering. > Steps to reproduce: > 1. Running instance of the machine > 2. Stop with the operating system > 3. Change offering of machine > 4. Start the machine > 5. Waiting for 10 minutes > 6. CloudStack stops the machine for no reason > 7. Restart the machine > In the logs you can find information: > 2014-02-05 06:27:00,974 DEBUG [xen.resource.CitrixResourceBase] > (DirectAgent-316:null) 11. The VM i-41-824-VM is in Running state > 2014-02-05 06:27:00,974 DEBUG [agent.transport.Request] > (DirectAgent-316:null) Seq 50-1756626952: Processing: { Ans: , MgmtId: > 160544475005497, via: 50, Ver: v1, Flags: 10, > [{"com.cloud.agent.api.ClusterSyncAnswer":{"_clusterId":2,"_newStates":{"i-41-824-VM":{"t":"f32a4fee-b64e-4868-9c52-2a27e32d4c0e","u":"Running","v":"viridian:true;acpi:true;apic:true;pae:true;nx:false;timeoffset:0;cores-per-socket:4"}},"_isExecuted":false,"result":true,"wait":0}}] > } > 2014-02-05 06:27:00,981 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (DirectAgent-316:null) VM i-41-824-VM: cs state = Running and realState = > Running > 2014-02-05 06:27:00,981 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (DirectAgent-316:null) VM i-41-824-VM: cs state = Running and realState = > Running > 2014-02-05 06:36:01,240 DEBUG [agent.transport.Request] > (HA-Worker-1:work-1511) Seq 51-1374970375: Sending { Cmd , MgmtId: > 160544475005497, via: 51, Ver: v1, Flags: 100111, > [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":true,"vmName":"i-41-824-VM","wait":0}}] > } > 2014-02-05 06:36:01,240 DEBUG [agent.transport.Request] > (HA-Worker-1:work-1511) Seq 51-1374970375: Executing: { Cmd , MgmtId: > 160544475005497, via: 51, Ver: v1, Flags: 100111, > [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":true,"vmName":"i-41-824-VM","wait":0}}] > } > 2014-02-05 06:36:01,383 DEBUG [xen.resource.CitrixResourceBase] > (DirectAgent-150:null) 9. The VM i-41-824-VM is in Stopping state > 2014-02-05 06:36:27,625 DEBUG [xen.resource.CitrixResourceBase] > (DirectAgent-150:null) 10. The VM i-41-824-VM is in Stopped state > You will notice that the stop of the machine corresponds to the component > HA-Worker. > Such operation after the upgrade is complicating work of users. -- This message was sent by Atlassian JIRA (v6.2#6252)