I found something odd, can you check the parameter called "wait", what value is it using ?
On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington <rfarring...@remitdata.com > wrote: > Yes the parameter was set long ago and the management server has been > restarted numerous time over the past few days as we played with other > parameters to no effect. > > After looking at the log a little more does the "Failed to send command, > due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands > 996939857 to Host 38 timed out after 7200" mean that the migration start > command is being sent in some kind of synchronous mode and not returning > control back to the job manager? > > > > > ________________________________________ > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > Sent: Monday, October 12, 2015 8:46 PM > To: users@cloudstack.apache.org > Subject: [Questionable] Re: Timeout with live migration > > I thought you using the command “migrateVirtualMachineWithVolume” but it > seems that you are using “migrateVolume” command from ACS's API. > > > For the code I debugged “migrateVirtualMachineWithVolume”, the parameter > 3600, means 1 hour of timeout. > > For the “migrateVolume” is the same, they both end up in > > “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”, > and in that method the parameter is the same. > > > If your parameter is set to 36000 (10 hours) I do not see why you are > getting the exception after 2 hours. > > Did you restart the management servers after you changed the parameter? > > On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington < > rfarring...@remitdata.com > > wrote: > > > Here is the full log, including the stack for the exception, that we get > > at the 2 hour mark. as for the migratewait it is set to 36000 which > should > > be 10 hours. > > > > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache] > > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request > > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting > some > > more time because this is the current command > > 2015-10-12 18:41:20,457 INFO [c.c.u.e.CSExceptionErrorCode] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception: > > com.cloud.exception.OperationTimedoutException in error code list for > > exceptions > > 2015-10-12 18:41:20,465 WARN [c.c.a.m.AgentAttache] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out > on > > Seq 38-996939857: { Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7), > > Ver: v1, Flags: 100111, > > > [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}] > > } > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Cancelling. > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more > > commands found > > 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due > to > > Agent:38, com.cloud.exception.OperationTimedoutException: Commands > > 996939857 to Host 38 timed out after 7200 > > 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed > > com.cloud.utils.exception.CloudRuntimeException: Failed to send command, > > due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands > > 996939857 to Host 38 timed out after 7200 > > at > > > org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116) > > at > > > org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382) > > at > > > org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421) > > at > > > org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70) > > at > > > org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931) > > at > > > com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680) > > at > > > com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666) > > at > > > com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622) > > at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:622) > > at > > > org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) > > at > > > org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183) > > at > > > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) > > at > > > org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91) > > at > > > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) > > at > > > org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204) > > at com.sun.proxy.$Proxy196.migrateVolume(Unknown Source) > > at > > > org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd.execute(MigrateVolumeCmd.java:103) > > at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161) > > at > > > com.cloud.api.ApiAsyncJobDispatcher.runJobInContext(ApiAsyncJobDispatcher.java:109) > > at > > com.cloud.api.ApiAsyncJobDispatcher$1.run(ApiAsyncJobDispatcher.java:66) > > at > > > org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) > > at > > > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) > > at > > > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) > > at > > com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:63) > > at > > > org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:509) > > at > > > org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) > > at > > > org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) > > at > > > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) > > at > > > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) > > at > > > org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) > > at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > at > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:701) > > 2015-10-12 18:41:20,479 WARN [o.a.c.s.d.ObjectInDataStoreManagerImpl] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Unsupported data object > > (VOLUME, > > org.apache.cloudstack.storage.datastore.PrimaryDataStoreImpl@4fa7a45f), > > no need to delete from object in store ref table > > 2015-10-12 18:41:20,479 DEBUG [c.c.s.VolumeApiServiceImpl] > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) migrate volume > > failed:com.cloud.utils.exception.CloudRuntimeException: Failed to send > > command, due to Agent:38, com.cloud.exception.OperationTimedoutException: > > Commands 996939857 to Host 38 timed out after 7200 > > 2015-10-12 18:41:20,480 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] > > (Job-Executor-63:ctx-f7b6817d) Complete async job-5257, jobStatus: > FAILED, > > resultCode: 530, result: > > > org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed > > to migrate volume"} > > 2015-10-12 18:41:20,486 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] > > (Job-Executor-63:ctx-f7b6817d) Done executing > > org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd for > job-5257 > > 2015-10-12 18:41:20,489 INFO [o.a.c.f.j.i.AsyncJobMonitor] > > (Job-Executor-63:ctx-f7b6817d) Remove job-5257 from job monitoring > > > > > > > > > > > > > > > > > > > > > > > > ________________________________________ > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > Sent: Monday, October 12, 2015 8:24 PM > > To: users@cloudstack.apache.org > > Subject: [Questionable] Re: [Questionable] Re: Timeout with live > migration > > > > Now I understand what you are doing, I am familiar with that concept > (live > > migration of VM within a cluster, having the VHD being moved from one SR > to > > another). > > > > I just got confused when I read live migration of volumes (a volume does > > not run by itself, so that why I asked a little for some more > information). > > > > Looking at the source code this is the variable used to control the > > timeout: > > "long timeout = (_migratewait) * 1000L;" > > > > The value of "_migratewait" is taken from this parameter: > > value = (String) params.get("migratewait"); > > _migratewait = NumbersUtil.parseInt(value, 3600); > > > > Therefore, the name of the parameter to be configured is "migratewait", > the > > default value is 3600. > > > > > > BTW1: I think that is a terrible parameter name. We should refactor that, > > could you open a Jira ticket for that? > > > > BTW2: that error message you posted does not seem to be related to the > > migration timeout; hence, in the code if the copy times out the message > > would be: > > "Async " + timeout/1000 + " seconds timeout for task " + task.toString()" > > > > Maybe because it throws a "Types.BadAsyncResult(msg)" and that might be > > translated into that message, or that might not be related to the problem > > itself, and you just thought that it was. > > > > > > Does it help you? > > > > > > On Mon, Oct 12, 2015 at 10:00 PM, Ryan Farrington < > > rfarring...@remitdata.com > > > wrote: > > > > > Hypervisor: XenServer > > > > > > We are moving a data volume from one storage onto another without > > shutting > > > down the VM cause that would just be silly and a triplication of effort > > > with the whole copying to secondary storage and then back off again. > The > > > volume is staying in the same cluster just moving to a different > Primary > > > storage (or SR in the XenServer vernacular) > > > > > > If you are familiar with ESX this is a "Storage VMotion" where as in > > > XenServer it is called "Storage XenMotion". > > > > > > ________________________________________ > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > > Sent: Monday, October 12, 2015 7:53 PM > > > To: users@cloudstack.apache.org > > > Subject: [Questionable] Re: Timeout with live migration > > > > > > what do you mean with livre migrating data volume ?! > > > I understand a live migration of a VM, but volumes... > > > > > > do you mean live migrating a VM that has a volume attached? > > > are you migrating that volume to a different cluster? or just a > different > > > storage in the same cluster? > > > What hypervisor are you using ? > > > > > > > > > On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington < > > > rfarring...@remitdata.com> > > > wrote: > > > > > > > Live migrating a data volume. We are purely on shared storage so no > > local > > > > storage is involved. > > > > > > > > ________________________________________ > > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > > > Sent: Monday, October 12, 2015 7:37 PM > > > > To: users@cloudstack.apache.org > > > > Subject: [Questionable] Re: Timeout with live migration > > > > > > > > Are you live migrating a VM, or migrating a volume of a stopped VM > to a > > > > different primary storage? > > > > > > > > If it is a running VM, is the VM allocated in a shared storage or > local > > > > storage? > > > > > > > > On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington < > > > > rfarring...@remitdata.com> > > > > wrote: > > > > > > > > > The slow transfer is related to the storage we are trying to > migrate > > > off > > > > > of. We are capable of getting about 350mbps off the disks but when > > we > > > > are > > > > > moving volumes that are greater than about 500GB we end up racing > the > > > > clock > > > > > and hoping that the migration finishes before the job times out. > It > > > > would > > > > > be awesome to be able to manage that timeout and I know there are a > > ton > > > > of > > > > > settings I just don't know about and am hoping someone might be > able > > to > > > > > point me in the right direction. > > > > > > > > > > > > > > > ________________________________________ > > > > > From: Rafael Weingärtner [rafaelweingart...@gmail.com] > > > > > Sent: Monday, October 12, 2015 6:40 PM > > > > > To: users@cloudstack.apache.org > > > > > Subject: [Questionable] Re: Timeout with live migration > > > > > > > > > > I would first check your NICs' speed and load, the amount of RAM > > > > allocated > > > > > for the migrating VM and than check the hypervisor log files. > > > > > > > > > > On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård < > > > > > jan.arve.nyg...@gmail.com> > > > > > wrote: > > > > > > > > > > > What version are you running? Check if the copy.volume.wait > setting > > > is > > > > > set > > > > > > to 7200 and increase it. If not you could also check > > > > > > job.cancel.threshold.minutes and job.expire.minutes. > > > > > > > > > > > > -Jan-Arve > > > > > > > > > > > > 2015-10-13 0:46 GMT+02:00 Ryan Farrington < > > rfarring...@remitdata.com > > > >: > > > > > > > > > > > > > We are experiencing a failure in cloudstack waiting for an > async > > > job > > > > > > > performing a live migration of a volume to finish. I've copied > > the > > > > > > relevant > > > > > > > log entries below.We acknowledge that the migration will take a > > few > > > > > hours > > > > > > > based on the volume of the data and we are looking for a way to > > > > > increase > > > > > > > the timeout of 7200 seconds into something we know we can work > > > with. > > > > > > > > > > > > > > > > > > > > > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint] > > > > > > > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send > > command, > > > > due > > > > > > to > > > > > > > Agent:27, com.cloud.exception.OperationTimedoutException: > > Commands > > > > > > > 835325398 to Host 27 timed out after 7200 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Rafael Weingärtner > > > > > > > > > > > > > > > > > > > > > -- > > > > Rafael Weingärtner > > > > > > > > > > > > > > > > -- > > > Rafael Weingärtner > > > > > > > > > > > -- > > Rafael Weingärtner > > > > > > -- > Rafael Weingärtner > -- Rafael Weingärtner