Re: [Questionable] Re: Timeout with live migration
Hi Ryan, we hit the same problem few days ago. After we changed parameters: migratewait, storage.pool.max.waitseconds, vm.op.cancel.interval and vm.op.cleanup.wait to 36000 (10h) we can migrate large volumes (500GB and more). We use xenserver 6.5 and ACS 4.5.1 Regards, Kuba W dniu 2015-10-13 o 05:15, Ryan Farrington pisze: Yes i can open JIRA tickets. What would you like for me to do? I'll be happy to change the "wait" parameter. Do I assume it should be 1/2 of the value i want it to be? From: Rafael Weingärtner [rafaelweingart...@gmail.com] Sent: Monday, October 12, 2015 10:12 PM To: users@cloudstack.apache.org Subject: [Questionable] Re: Timeout with live migration There is your problem, there are currently two distinct values conrolling those async jobs. Change that value and everything will work for u. Can you open a jira ticket? On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington <rfarring...@remitdata.com wrote: wait is currently configured to be 3600 From: Rafael Weingärtner [rafaelweingart...@gmail.com] Sent: Monday, October 12, 2015 9:46 PM To: users@cloudstack.apache.org Subject: [Questionable] Re: Timeout with live migration I found something odd, can you check the parameter called "wait", what value is it using ? On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington < rfarring...@remitdata.com wrote: Yes the parameter was set long ago and the management server has been restarted numerous time over the past few days as we played with other parameters to no effect. After looking at the log a little more does the "Failed to send command, due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands 996939857 to Host 38 timed out after 7200" mean that the migration start command is being sent in some kind of synchronous mode and not returning control back to the job manager? From: Rafael Weingärtner [rafaelweingart...@gmail.com] Sent: Monday, October 12, 2015 8:46 PM To: users@cloudstack.apache.org Subject: [Questionable] Re: Timeout with live migration I thought you using the command “migrateVirtualMachineWithVolume” but it seems that you are using “migrateVolume” command from ACS's API. For the code I debugged “migrateVirtualMachineWithVolume”, the parameter 3600, means 1 hour of timeout. For the “migrateVolume” is the same, they both end up in “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”, and in that method the parameter is the same. If your parameter is set to 36000 (10 hours) I do not see why you are getting the exception after 2 hours. Did you restart the management servers after you changed the parameter? On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington < rfarring...@remitdata.com wrote: Here is the full log, including the stack for the exception, that we get at the 2 hour mark. as for the migratewait it is set to 36000 which should be 10 hours. 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache] (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting some more time because this is the current command 2015-10-12 18:41:20,457 INFO [c.c.u.e.CSExceptionErrorCode] (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception: com.cloud.exception.OperationTimedoutException in error code list for exceptions 2015-10-12 18:41:20,465 WARN [c.c.a.m.AgentAttache] (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out on Seq 38-996939857: { Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7), Ver: v1, Flags: 100111, [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}] } 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache] (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Cancelling. 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache] (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more commands found 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint] (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands 996939857 to Host 38 timed out after 7200 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy] (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed com.cloud.utils.exception.CloudRuntimeException: Failed to send command, d
Re: [Questionable] Re: Timeout with live migration
I guess so, for some reason that I do not understand, the code is multiplying the value from that parameter by 2, something like 18000 should do the tricky On Tue, Oct 13, 2015 at 12:15 AM, Ryan Farrington <rfarring...@remitdata.com > wrote: > Yes i can open JIRA tickets. What would you like for me to do? > > I'll be happy to change the "wait" parameter. Do I assume it should be > 1/2 of the value i want it to be? > > > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > Sent: Monday, October 12, 2015 10:12 PM > To: users@cloudstack.apache.org > Subject: [Questionable] Re: Timeout with live migration > > There is your problem, there are currently two distinct values conrolling > those async jobs. > Change that value and everything will work for u. > Can you open a jira ticket? > > On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington < > rfarring...@remitdata.com > > wrote: > > > wait is currently configured to be 3600 > > > > > > > > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > Sent: Monday, October 12, 2015 9:46 PM > > To: users@cloudstack.apache.org > > Subject: [Questionable] Re: Timeout with live migration > > > > I found something odd, > > can you check the parameter called "wait", what value is it using ? > > > > On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington < > > rfarring...@remitdata.com > > > wrote: > > > > > Yes the parameter was set long ago and the management server has been > > > restarted numerous time over the past few days as we played with other > > > parameters to no effect. > > > > > > After looking at the log a little more does the "Failed to send > command, > > > due to Agent:38, com.cloud.exception.OperationTimedoutException: > Commands > > > 996939857 to Host 38 timed out after 7200" mean that the migration > start > > > command is being sent in some kind of synchronous mode and not > returning > > > control back to the job manager? > > > > > > > > > > > > > > > > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > > Sent: Monday, October 12, 2015 8:46 PM > > > To: users@cloudstack.apache.org > > > Subject: [Questionable] Re: Timeout with live migration > > > > > > I thought you using the command “migrateVirtualMachineWithVolume” but > it > > > seems that you are using “migrateVolume” command from ACS's API. > > > > > > > > > For the code I debugged “migrateVirtualMachineWithVolume”, the > parameter > > > 3600, means 1 hour of timeout. > > > > > > For the “migrateVolume” is the same, they both end up in > > > > > > > > > “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”, > > > and in that method the parameter is the same. > > > > > > > > > If your parameter is set to 36000 (10 hours) I do not see why you are > > > getting the exception after 2 hours. > > > > > > Did you restart the management servers after you changed the parameter? > > > > > > On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington < > > > rfarring...@remitdata.com > > > > wrote: > > > > > > > Here is the full log, including the stack for the exception, that we > > get > > > > at the 2 hour mark. as for the migratewait it is set to 36000 which > > > should > > > > be 10 hours. > > > > > > > > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache] > > > > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request > > > > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache] > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting > > > some > > > > more time because this is the current command > > > > 2015-10-12 18:41:20,457 INFO [c.c.u.e.CSExceptionErrorCode] > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception: > > > > com.cloud.exception.OperationTimedoutException in error code list for > > > > exceptions > > > > 2015-10-12 18:41:20,465 WARN [c.c.a.m.AgentAttache] > > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed > out > > > on > > > > Seq 38-996939857: { Cmd , MgmtId: 42756806312036, via:
RE: [Questionable] Re: Timeout with live migration
Hypervisor: XenServer We are moving a data volume from one storage onto another without shutting down the VM cause that would just be silly and a triplication of effort with the whole copying to secondary storage and then back off again. The volume is staying in the same cluster just moving to a different Primary storage (or SR in the XenServer vernacular) If you are familiar with ESX this is a "Storage VMotion" where as in XenServer it is called "Storage XenMotion". From: Rafael Weingärtner [rafaelweingart...@gmail.com] Sent: Monday, October 12, 2015 7:53 PM To: users@cloudstack.apache.org Subject: [Questionable] Re: Timeout with live migration what do you mean with livre migrating data volume ?! I understand a live migration of a VM, but volumes... do you mean live migrating a VM that has a volume attached? are you migrating that volume to a different cluster? or just a different storage in the same cluster? What hypervisor are you using ? On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington <rfarring...@remitdata.com> wrote: > Live migrating a data volume. We are purely on shared storage so no local > storage is involved. > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > Sent: Monday, October 12, 2015 7:37 PM > To: users@cloudstack.apache.org > Subject: [Questionable] Re: Timeout with live migration > > Are you live migrating a VM, or migrating a volume of a stopped VM to a > different primary storage? > > If it is a running VM, is the VM allocated in a shared storage or local > storage? > > On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington < > rfarring...@remitdata.com> > wrote: > > > The slow transfer is related to the storage we are trying to migrate off > > of. We are capable of getting about 350mbps off the disks but when we > are > > moving volumes that are greater than about 500GB we end up racing the > clock > > and hoping that the migration finishes before the job times out. It > would > > be awesome to be able to manage that timeout and I know there are a ton > of > > settings I just don't know about and am hoping someone might be able to > > point me in the right direction. > > > > > > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > Sent: Monday, October 12, 2015 6:40 PM > > To: users@cloudstack.apache.org > > Subject: [Questionable] Re: Timeout with live migration > > > > I would first check your NICs' speed and load, the amount of RAM > allocated > > for the migrating VM and than check the hypervisor log files. > > > > On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård < > > jan.arve.nyg...@gmail.com> > > wrote: > > > > > What version are you running? Check if the copy.volume.wait setting is > > set > > > to 7200 and increase it. If not you could also check > > > job.cancel.threshold.minutes and job.expire.minutes. > > > > > > -Jan-Arve > > > > > > 2015-10-13 0:46 GMT+02:00 Ryan Farrington <rfarring...@remitdata.com>: > > > > > > > We are experiencing a failure in cloudstack waiting for an async job > > > > performing a live migration of a volume to finish. I've copied the > > > relevant > > > > log entries below.We acknowledge that the migration will take a few > > hours > > > > based on the volume of the data and we are looking for a way to > > increase > > > > the timeout of 7200 seconds into something we know we can work with. > > > > > > > > > > > > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint] > > > > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command, > due > > > to > > > > Agent:27, com.cloud.exception.OperationTimedoutException: Commands > > > > 835325398 to Host 27 timed out after 7200 > > > > > > > > > > > > > > > > > > > > > > > -- > > Rafael Weingärtner > > > > > > -- > Rafael Weingärtner > -- Rafael Weingärtner
Re: [Questionable] Re: Timeout with live migration
Now I understand what you are doing, I am familiar with that concept (live migration of VM within a cluster, having the VHD being moved from one SR to another). I just got confused when I read live migration of volumes (a volume does not run by itself, so that why I asked a little for some more information). Looking at the source code this is the variable used to control the timeout: "long timeout = (_migratewait) * 1000L;" The value of "_migratewait" is taken from this parameter: value = (String) params.get("migratewait"); _migratewait = NumbersUtil.parseInt(value, 3600); Therefore, the name of the parameter to be configured is "migratewait", the default value is 3600. BTW1: I think that is a terrible parameter name. We should refactor that, could you open a Jira ticket for that? BTW2: that error message you posted does not seem to be related to the migration timeout; hence, in the code if the copy times out the message would be: "Async " + timeout/1000 + " seconds timeout for task " + task.toString()" Maybe because it throws a "Types.BadAsyncResult(msg)" and that might be translated into that message, or that might not be related to the problem itself, and you just thought that it was. Does it help you? On Mon, Oct 12, 2015 at 10:00 PM, Ryan Farrington <rfarring...@remitdata.com > wrote: > Hypervisor: XenServer > > We are moving a data volume from one storage onto another without shutting > down the VM cause that would just be silly and a triplication of effort > with the whole copying to secondary storage and then back off again. The > volume is staying in the same cluster just moving to a different Primary > storage (or SR in the XenServer vernacular) > > If you are familiar with ESX this is a "Storage VMotion" where as in > XenServer it is called "Storage XenMotion". > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > Sent: Monday, October 12, 2015 7:53 PM > To: users@cloudstack.apache.org > Subject: [Questionable] Re: Timeout with live migration > > what do you mean with livre migrating data volume ?! > I understand a live migration of a VM, but volumes... > > do you mean live migrating a VM that has a volume attached? > are you migrating that volume to a different cluster? or just a different > storage in the same cluster? > What hypervisor are you using ? > > > On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington < > rfarring...@remitdata.com> > wrote: > > > Live migrating a data volume. We are purely on shared storage so no local > > storage is involved. > > > > ____ > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > Sent: Monday, October 12, 2015 7:37 PM > > To: users@cloudstack.apache.org > > Subject: [Questionable] Re: Timeout with live migration > > > > Are you live migrating a VM, or migrating a volume of a stopped VM to a > > different primary storage? > > > > If it is a running VM, is the VM allocated in a shared storage or local > > storage? > > > > On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington < > > rfarring...@remitdata.com> > > wrote: > > > > > The slow transfer is related to the storage we are trying to migrate > off > > > of. We are capable of getting about 350mbps off the disks but when we > > are > > > moving volumes that are greater than about 500GB we end up racing the > > clock > > > and hoping that the migration finishes before the job times out. It > > would > > > be awesome to be able to manage that timeout and I know there are a ton > > of > > > settings I just don't know about and am hoping someone might be able to > > > point me in the right direction. > > > > > > > > > > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > > Sent: Monday, October 12, 2015 6:40 PM > > > To: users@cloudstack.apache.org > > > Subject: [Questionable] Re: Timeout with live migration > > > > > > I would first check your NICs' speed and load, the amount of RAM > > allocated > > > for the migrating VM and than check the hypervisor log files. > > > > > > On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård < > > > jan.arve.nyg...@gmail.com> > > > wrote: > > > > > > > What version are you running? Check if the copy.volume.wait setting > is > > > set > > > > to 7200 and increase it. If not you could also check > > > > j
RE: [Questionable] Re: Timeout with live migration
Yes i can open JIRA tickets. What would you like for me to do? I'll be happy to change the "wait" parameter. Do I assume it should be 1/2 of the value i want it to be? From: Rafael Weingärtner [rafaelweingart...@gmail.com] Sent: Monday, October 12, 2015 10:12 PM To: users@cloudstack.apache.org Subject: [Questionable] Re: Timeout with live migration There is your problem, there are currently two distinct values conrolling those async jobs. Change that value and everything will work for u. Can you open a jira ticket? On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington <rfarring...@remitdata.com > wrote: > wait is currently configured to be 3600 > > > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > Sent: Monday, October 12, 2015 9:46 PM > To: users@cloudstack.apache.org > Subject: [Questionable] Re: Timeout with live migration > > I found something odd, > can you check the parameter called "wait", what value is it using ? > > On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington < > rfarring...@remitdata.com > > wrote: > > > Yes the parameter was set long ago and the management server has been > > restarted numerous time over the past few days as we played with other > > parameters to no effect. > > > > After looking at the log a little more does the "Failed to send command, > > due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands > > 996939857 to Host 38 timed out after 7200" mean that the migration start > > command is being sent in some kind of synchronous mode and not returning > > control back to the job manager? > > > > > > > > > > ________ > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > Sent: Monday, October 12, 2015 8:46 PM > > To: users@cloudstack.apache.org > > Subject: [Questionable] Re: Timeout with live migration > > > > I thought you using the command “migrateVirtualMachineWithVolume” but it > > seems that you are using “migrateVolume” command from ACS's API. > > > > > > For the code I debugged “migrateVirtualMachineWithVolume”, the parameter > > 3600, means 1 hour of timeout. > > > > For the “migrateVolume” is the same, they both end up in > > > > > “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”, > > and in that method the parameter is the same. > > > > > > If your parameter is set to 36000 (10 hours) I do not see why you are > > getting the exception after 2 hours. > > > > Did you restart the management servers after you changed the parameter? > > > > On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington < > > rfarring...@remitdata.com > > > wrote: > > > > > Here is the full log, including the stack for the exception, that we > get > > > at the 2 hour mark. as for the migratewait it is set to 36000 which > > should > > > be 10 hours. > > > > > > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache] > > > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request > > > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache] > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting > > some > > > more time because this is the current command > > > 2015-10-12 18:41:20,457 INFO [c.c.u.e.CSExceptionErrorCode] > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception: > > > com.cloud.exception.OperationTimedoutException in error code list for > > > exceptions > > > 2015-10-12 18:41:20,465 WARN [c.c.a.m.AgentAttache] > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out > > on > > > Seq 38-996939857: { Cmd , MgmtId: 42756806312036, via: > 38(xen-nc-bc2b7), > > > Ver: v1, Flags: 100111, > > > > > > [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}] > > > } > > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache] > > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: > Cancelling. > > > 2015-