Re: [Questionable] Re: Timeout with live migration

Jakub Kublik Tue, 13 Oct 2015 01:46:46 -0700

Hi Ryan,

we hit the same problem few days ago. After we changed parameters:migratewait, storage.pool.max.waitseconds, vm.op.cancel.interval andvm.op.cleanup.wait to 36000 (10h) we can migrate large volumes (500GBand more).

We use xenserver 6.5 and ACS 4.5.1
Regards,
Kuba


W dniu 2015-10-13 o 05:15, Ryan Farrington pisze:

Yes i can open JIRA tickets. What would you like for me to do?

I'll be happy to change the "wait" parameter.  Do I assume it should be 1/2 of 
the value i want it to be?



________________________________________
From: Rafael Weingärtner [[email protected]]
Sent: Monday, October 12, 2015 10:12 PM
To: [email protected]
Subject: [Questionable]  Re: Timeout with live migration

There is your problem, there are currently two distinct values conrolling
those async jobs.
Change that value and everything will work for u.
Can you open a jira ticket?

On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington <[email protected]

wrote:
wait is currently configured to be 3600



________________________________________
From: Rafael Weingärtner [[email protected]]
Sent: Monday, October 12, 2015 9:46 PM
To: [email protected]
Subject: [Questionable]  Re: Timeout with live migration

I found something odd,
can you check the parameter called "wait", what value is it using ?

On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington <
[email protected]

wrote:
Yes the parameter was set long ago and the management server has been
restarted numerous time over the past few days as we played with other
parameters to no effect.

After looking at the log a little more does the "Failed to send command,
due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
996939857 to Host 38 timed out after 7200" mean that the migration start
command is being sent in some kind of synchronous mode and not returning
control back to the job manager?




________________________________________
From: Rafael Weingärtner [[email protected]]
Sent: Monday, October 12, 2015 8:46 PM
To: [email protected]
Subject: [Questionable]  Re: Timeout with live migration

I thought you using the command  “migrateVirtualMachineWithVolume” but it
seems that you are using “migrateVolume” command from ACS's API.


For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
3600, means 1 hour of timeout.

For the “migrateVolume” is the same, they both end up in

“com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,

and in that method the parameter is the same.


If your parameter is set to 36000 (10 hours) I do not see why you are
getting the exception after 2 hours.

Did you restart the management servers after you changed the parameter?

On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
[email protected]

wrote:
Here is the full log, including the stack for the exception, that we

get

at the 2 hour mark. as for the migratewait it is set to 36000 which

should

be 10 hours.

2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
(DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting

some

more time because this is the current command
2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list for
exceptions
2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out

on

Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via:

38(xen-nc-bc2b7),

Ver: v1, Flags: 100111,

[{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]

}
2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:

Cancelling.

2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
commands found
2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due

to

Agent:38, com.cloud.exception.OperationTimedoutException: Commands
996939857 to Host 38 timed out after 7200
2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
com.cloud.utils.exception.CloudRuntimeException: Failed to send

command,

due to Agent:38, com.cloud.exception.OperationTimedoutException:

Commands

996939857 to Host 38 timed out after 7200
         at

org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)

at

org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)

at

org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)

at

org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)

at

org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)

at

com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)

at

com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)

at

com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622)

         at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown

Source)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

         at java.lang.reflect.Method.invoke(Method.java:622)
         at

org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)

at

org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)

at

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)

at

org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)

at

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)

at

org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)

         at com.sun.proxy.$Proxy196.migrateVolume(Unknown Source)
         at

org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd.execute(MigrateVolumeCmd.java:103)

         at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161)
         at

com.cloud.api.ApiAsyncJobDispatcher.runJobInContext(ApiAsyncJobDispatcher.java:109)

at

com.cloud.api.ApiAsyncJobDispatcher$1.run(ApiAsyncJobDispatcher.java:66)

at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)

at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)

at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)

at

com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:63)

at

org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:509)

at

org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)

at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)

at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)

at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)

at

org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)

         at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
         at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
         at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

         at java.lang.Thread.run(Thread.java:701)
2015-10-12 18:41:20,479 WARN  [o.a.c.s.d.ObjectInDataStoreManagerImpl]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Unsupported data object
(VOLUME,
org.apache.cloudstack.storage.datastore.PrimaryDataStoreImpl@4fa7a45f

),

no need to delete from object in store ref table
2015-10-12 18:41:20,479 DEBUG [c.c.s.VolumeApiServiceImpl]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) migrate volume
failed:com.cloud.utils.exception.CloudRuntimeException: Failed to send
command, due to Agent:38,

com.cloud.exception.OperationTimedoutException:

Commands 996939857 to Host 38 timed out after 7200
2015-10-12 18:41:20,480 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(Job-Executor-63:ctx-f7b6817d) Complete async job-5257, jobStatus:

FAILED,

resultCode: 530, result:

org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed

to migrate volume"}
2015-10-12 18:41:20,486 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(Job-Executor-63:ctx-f7b6817d) Done executing
org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd for

job-5257

2015-10-12 18:41:20,489 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
(Job-Executor-63:ctx-f7b6817d) Remove job-5257 from job monitoring











________________________________________
From: Rafael Weingärtner [[email protected]]
Sent: Monday, October 12, 2015 8:24 PM
To: [email protected]
Subject: [Questionable]  Re: [Questionable] Re: Timeout with live

migration

Now I understand what you are doing, I am familiar with that concept

(live

migration of VM within a cluster, having the VHD being moved from one

SR

to

another).

I just got confused when I read live migration of volumes (a volume

does

not run by itself, so that why I asked a little for some more

information).

Looking at the source code this is the variable used to control the
timeout:
"long timeout = (_migratewait) * 1000L;"

The value of "_migratewait" is taken from this parameter:
value = (String) params.get("migratewait");
_migratewait = NumbersUtil.parseInt(value, 3600);

Therefore, the name of the parameter to be configured is "migratewait",

the

default value is 3600.


BTW1: I think that is a terrible parameter name. We should refactor

that,

could you open a Jira ticket for that?

BTW2: that error message you posted does not seem to be related to the
migration timeout; hence, in the code if the copy times out the message
would be:
"Async " + timeout/1000 + " seconds timeout for task " +

task.toString()"

Maybe because it throws a "Types.BadAsyncResult(msg)" and that might be
translated into that message, or that might not be related to the

problem

itself, and you just thought that it was.


Does it help you?


On Mon, Oct 12, 2015 at 10:00 PM, Ryan Farrington <
[email protected]

wrote:
Hypervisor:  XenServer

We are moving a data volume from one storage onto another without

shutting

down the VM cause that would just be silly and a triplication of

effort

with the whole copying to secondary storage and then back off again.

The

volume is staying in the same cluster just moving to a different

Primary

storage (or SR in the XenServer vernacular)

If you are familiar with ESX this is a "Storage VMotion" where as in
XenServer it is called "Storage XenMotion".

________________________________________
From: Rafael Weingärtner [[email protected]]
Sent: Monday, October 12, 2015 7:53 PM
To: [email protected]
Subject: [Questionable]  Re: Timeout with live migration

what do you mean with livre migrating data volume ?!
I understand a live migration of a VM, but volumes...

do you mean live migrating a VM that has a volume attached?
are you migrating that volume to a different cluster? or just a

different

storage in the same cluster?
What hypervisor are you using ?


On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington <
[email protected]>
wrote:

Live migrating a data volume. We are purely on shared storage so no

local

storage is involved.

________________________________________
From: Rafael Weingärtner [[email protected]]
Sent: Monday, October 12, 2015 7:37 PM
To: [email protected]
Subject: [Questionable]  Re: Timeout with live migration

Are you live migrating a VM, or migrating a volume of a stopped VM

to a

different primary storage?

If it is a running VM, is the VM allocated in a shared storage or

local

storage?

On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington <
[email protected]>
wrote:

The slow transfer is related to the storage we are trying to

migrate

off

of.  We are capable of getting about 350mbps off the disks but

when

we

are

moving volumes that are greater than about 500GB we end up racing

the

clock

and hoping that the migration finishes before the job times out.

It

would

be awesome to be able to manage that timeout and I know there

are a

ton

of

settings I just don't know about and am hoping someone might be

able

to

point me in the right direction.


________________________________________
From: Rafael Weingärtner [[email protected]]
Sent: Monday, October 12, 2015 6:40 PM
To: [email protected]
Subject: [Questionable]  Re: Timeout with live migration

I would first check your NICs' speed and load, the amount of RAM

allocated

for the migrating VM and than check the hypervisor log files.

On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
[email protected]>
wrote:

What version are you running? Check if the copy.volume.wait

setting

is

set

to 7200 and increase it. If not you could also check
job.cancel.threshold.minutes and job.expire.minutes.

-Jan-Arve

2015-10-13 0:46 GMT+02:00 Ryan Farrington <

[email protected]

We are experiencing a failure in cloudstack waiting for an

async

job

performing a live migration of a volume to finish. I've

copied

the

relevant

log entries below.We acknowledge that the migration will

take a

few

hours

based on the volume of the data and we are looking for a way

to

increase

the timeout of 7200 seconds into something we know we can

work

with.


2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
(Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send

command,

due

to

Agent:27, com.cloud.exception.OperationTimedoutException:

Commands

835325398 to Host 27 timed out after 7200



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner

Re: [Questionable] Re: Timeout with live migration

Reply via email to