Re: Timeout with live migration

Rafael Weingärtner Mon, 12 Oct 2015 19:48:12 -0700

I found something odd,
can you check the parameter called "wait", what value is it using ?


On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington <[email protected]
> wrote:

> Yes the parameter was set long ago and the management server has been
> restarted numerous time over the past few days as we played with other
> parameters to no effect.
>
> After looking at the log a little more does the "Failed to send command,
> due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> 996939857 to Host 38 timed out after 7200" mean that the migration start
> command is being sent in some kind of synchronous mode and not returning
> control back to the job manager?
>
>
>
>
> ________________________________________
> From: Rafael Weingärtner [[email protected]]
> Sent: Monday, October 12, 2015 8:46 PM
> To: [email protected]
> Subject: [Questionable]  Re: Timeout with live migration
>
> I thought you using the command  “migrateVirtualMachineWithVolume” but it
> seems that you are using “migrateVolume” command from ACS's API.
>
>
> For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
> 3600, means 1 hour of timeout.
>
> For the “migrateVolume” is the same, they both end up in
>
> “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
> and in that method the parameter is the same.
>
>
> If your parameter is set to 36000 (10 hours) I do not see why you are
> getting the exception after 2 hours.
>
> Did you restart the management servers after you changed the parameter?
>
> On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
> [email protected]
> > wrote:
>
> > Here is the full log, including the stack for the exception, that we get
> > at the 2 hour mark. as for the migratewait it is set to 36000 which
> should
> > be 10 hours.
> >
> > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
> > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
> > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting
> some
> > more time because this is the current command
> > 2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
> > com.cloud.exception.OperationTimedoutException in error code list for
> > exceptions
> > 2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out
> on
> > Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7),
> > Ver: v1, Flags: 100111,
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> > }
> > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Cancelling.
> > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
> > commands found
> > 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due
> to
> > Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > 996939857 to Host 38 timed out after 7200
> > 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
> > com.cloud.utils.exception.CloudRuntimeException: Failed to send command,
> > due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > 996939857 to Host 38 timed out after 7200
> >         at
> >
> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
> >         at
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
> >         at
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
> >         at
> >
> org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
> >         at
> >
> org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)
> >         at
> >
> com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)
> >         at
> >
> com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)
> >         at
> >
> com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622)
> >         at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown Source)
> >         at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:622)
> >         at
> >
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
> >         at
> >
> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
> >         at
> >
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
> >         at
> >
> org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
> >         at
> >
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
> >         at
> >
> org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
> >         at com.sun.proxy.$Proxy196.migrateVolume(Unknown Source)
> >         at
> >
> org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd.execute(MigrateVolumeCmd.java:103)
> >         at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161)
> >         at
> >
> com.cloud.api.ApiAsyncJobDispatcher.runJobInContext(ApiAsyncJobDispatcher.java:109)
> >         at
> > com.cloud.api.ApiAsyncJobDispatcher$1.run(ApiAsyncJobDispatcher.java:66)
> >         at
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
> >         at
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
> >         at
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
> >         at
> > com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:63)
> >         at
> >
> org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:509)
> >         at
> >
> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
> >         at
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
> >         at
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
> >         at
> >
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
> >         at
> >
> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >         at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >         at java.lang.Thread.run(Thread.java:701)
> > 2015-10-12 18:41:20,479 WARN  [o.a.c.s.d.ObjectInDataStoreManagerImpl]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Unsupported data object
> > (VOLUME,
> > org.apache.cloudstack.storage.datastore.PrimaryDataStoreImpl@4fa7a45f),
> > no need to delete from object in store ref table
> > 2015-10-12 18:41:20,479 DEBUG [c.c.s.VolumeApiServiceImpl]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) migrate volume
> > failed:com.cloud.utils.exception.CloudRuntimeException: Failed to send
> > command, due to Agent:38, com.cloud.exception.OperationTimedoutException:
> > Commands 996939857 to Host 38 timed out after 7200
> > 2015-10-12 18:41:20,480 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
> > (Job-Executor-63:ctx-f7b6817d) Complete async job-5257, jobStatus:
> FAILED,
> > resultCode: 530, result:
> >
> org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed
> > to migrate volume"}
> > 2015-10-12 18:41:20,486 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
> > (Job-Executor-63:ctx-f7b6817d) Done executing
> > org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd for
> job-5257
> > 2015-10-12 18:41:20,489 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
> > (Job-Executor-63:ctx-f7b6817d) Remove job-5257 from job monitoring
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Rafael Weingärtner [[email protected]]
> > Sent: Monday, October 12, 2015 8:24 PM
> > To: [email protected]
> > Subject: [Questionable]  Re: [Questionable] Re: Timeout with live
> migration
> >
> > Now I understand what you are doing, I am familiar with that concept
> (live
> > migration of VM within a cluster, having the VHD being moved from one SR
> to
> > another).
> >
> > I just got confused when I read live migration of volumes (a volume does
> > not run by itself, so that why I asked a little for some more
> information).
> >
> > Looking at the source code this is the variable used to control the
> > timeout:
> > "long timeout = (_migratewait) * 1000L;"
> >
> > The value of "_migratewait" is taken from this parameter:
> > value = (String) params.get("migratewait");
> > _migratewait = NumbersUtil.parseInt(value, 3600);
> >
> > Therefore, the name of the parameter to be configured is "migratewait",
> the
> > default value is 3600.
> >
> >
> > BTW1: I think that is a terrible parameter name. We should refactor that,
> > could you open a Jira ticket for that?
> >
> > BTW2: that error message you posted does not seem to be related to the
> > migration timeout; hence, in the code if the copy times out the message
> > would be:
> > "Async " + timeout/1000 + " seconds timeout for task " + task.toString()"
> >
> > Maybe because it throws a "Types.BadAsyncResult(msg)" and that might be
> > translated into that message, or that might not be related to the problem
> > itself, and you just thought that it was.
> >
> >
> > Does it help you?
> >
> >
> > On Mon, Oct 12, 2015 at 10:00 PM, Ryan Farrington <
> > [email protected]
> > > wrote:
> >
> > > Hypervisor:  XenServer
> > >
> > > We are moving a data volume from one storage onto another without
> > shutting
> > > down the VM cause that would just be silly and a triplication of effort
> > > with the whole copying to secondary storage and then back off again.
> The
> > > volume is staying in the same cluster just moving to a different
> Primary
> > > storage (or SR in the XenServer vernacular)
> > >
> > > If you are familiar with ESX this is a "Storage VMotion" where as in
> > > XenServer it is called "Storage XenMotion".
> > >
> > > ________________________________________
> > > From: Rafael Weingärtner [[email protected]]
> > > Sent: Monday, October 12, 2015 7:53 PM
> > > To: [email protected]
> > > Subject: [Questionable]  Re: Timeout with live migration
> > >
> > > what do you mean with livre migrating data volume ?!
> > > I understand a live migration of a VM, but volumes...
> > >
> > > do you mean live migrating a VM that has a volume attached?
> > > are you migrating that volume to a different cluster? or just a
> different
> > > storage in the same cluster?
> > > What hypervisor are you using ?
> > >
> > >
> > > On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington <
> > > [email protected]>
> > > wrote:
> > >
> > > > Live migrating a data volume. We are purely on shared storage so no
> > local
> > > > storage is involved.
> > > >
> > > > ________________________________________
> > > > From: Rafael Weingärtner [[email protected]]
> > > > Sent: Monday, October 12, 2015 7:37 PM
> > > > To: [email protected]
> > > > Subject: [Questionable]  Re: Timeout with live migration
> > > >
> > > > Are you live migrating a VM, or migrating a volume of a stopped VM
> to a
> > > > different primary storage?
> > > >
> > > > If it is a running VM, is the VM allocated in a shared storage or
> local
> > > > storage?
> > > >
> > > > On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > The slow transfer is related to the storage we are trying to
> migrate
> > > off
> > > > > of.  We are capable of getting about 350mbps off the disks but when
> > we
> > > > are
> > > > > moving volumes that are greater than about 500GB we end up racing
> the
> > > > clock
> > > > > and hoping that the migration finishes before the job times out.
>  It
> > > > would
> > > > > be awesome to be able to manage that timeout and I know there are a
> > ton
> > > > of
> > > > > settings I just don't know about and am hoping someone might be
> able
> > to
> > > > > point me in the right direction.
> > > > >
> > > > >
> > > > > ________________________________________
> > > > > From: Rafael Weingärtner [[email protected]]
> > > > > Sent: Monday, October 12, 2015 6:40 PM
> > > > > To: [email protected]
> > > > > Subject: [Questionable]  Re: Timeout with live migration
> > > > >
> > > > > I would first check your NICs' speed and load, the amount of RAM
> > > > allocated
> > > > > for the migrating VM and than check the hypervisor log files.
> > > > >
> > > > > On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
> > > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > What version are you running? Check if the copy.volume.wait
> setting
> > > is
> > > > > set
> > > > > > to 7200 and increase it. If not you could also check
> > > > > > job.cancel.threshold.minutes and job.expire.minutes.
> > > > > >
> > > > > > -Jan-Arve
> > > > > >
> > > > > > 2015-10-13 0:46 GMT+02:00 Ryan Farrington <
> > [email protected]
> > > >:
> > > > > >
> > > > > > > We are experiencing a failure in cloudstack waiting for an
> async
> > > job
> > > > > > > performing a live migration of a volume to finish. I've copied
> > the
> > > > > > relevant
> > > > > > > log entries below.We acknowledge that the migration will take a
> > few
> > > > > hours
> > > > > > > based on the volume of the data and we are looking for a way to
> > > > > increase
> > > > > > > the timeout of 7200 seconds into something we know we can work
> > > with.
> > > > > > >
> > > > > > >
> > > > > > > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > > > > > > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send
> > command,
> > > > due
> > > > > > to
> > > > > > > Agent:27, com.cloud.exception.OperationTimedoutException:
> > Commands
> > > > > > > 835325398 to Host 27 timed out after 7200
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Rafael Weingärtner
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Rafael Weingärtner
> > > >
> > >
> > >
> > >
> > > --
> > > Rafael Weingärtner
> > >
> >
> >
> >
> > --
> > Rafael Weingärtner
> >
>
>
>
> --
> Rafael Weingärtner
>



-- 
Rafael Weingärtner

Re: Timeout with live migration

Reply via email to