Hi Ryan,
we hit the same problem few days ago. After we changed parameters: migratewait, storage.pool.max.waitseconds, vm.op.cancel.interval and vm.op.cleanup.wait to 36000 (10h) we can migrate large volumes (500GB and more).
We use xenserver 6.5 and ACS 4.5.1
Regards,
Kuba

W dniu 2015-10-13 o 05:15, Ryan Farrington pisze:
Yes i can open JIRA tickets. What would you like for me to do?

I'll be happy to change the "wait" parameter.  Do I assume it should be 1/2 of 
the value i want it to be?



________________________________________
From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 10:12 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

There is your problem, there are currently two distinct values conrolling
those async jobs.
Change that value and everything will work for u.
Can you open a jira ticket?

On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington <rfarring...@remitdata.com
wrote:
wait is currently configured to be 3600



________________________________________
From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 9:46 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

I found something odd,
can you check the parameter called "wait", what value is it using ?

On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington <
rfarring...@remitdata.com
wrote:
Yes the parameter was set long ago and the management server has been
restarted numerous time over the past few days as we played with other
parameters to no effect.

After looking at the log a little more does the "Failed to send command,
due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
996939857 to Host 38 timed out after 7200" mean that the migration start
command is being sent in some kind of synchronous mode and not returning
control back to the job manager?




________________________________________
From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 8:46 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

I thought you using the command  “migrateVirtualMachineWithVolume” but it
seems that you are using “migrateVolume” command from ACS's API.


For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
3600, means 1 hour of timeout.

For the “migrateVolume” is the same, they both end up in


“com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
and in that method the parameter is the same.


If your parameter is set to 36000 (10 hours) I do not see why you are
getting the exception after 2 hours.

Did you restart the management servers after you changed the parameter?

On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
rfarring...@remitdata.com
wrote:
Here is the full log, including the stack for the exception, that we
get
at the 2 hour mark. as for the migratewait it is set to 36000 which
should
be 10 hours.

2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
(DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting
some
more time because this is the current command
2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list for
exceptions
2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out
on
Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via:
38(xen-nc-bc2b7),
Ver: v1, Flags: 100111,

[{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
}
2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:
Cancelling.
2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
commands found
2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due
to
Agent:38, com.cloud.exception.OperationTimedoutException: Commands
996939857 to Host 38 timed out after 7200
2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
com.cloud.utils.exception.CloudRuntimeException: Failed to send
command,
due to Agent:38, com.cloud.exception.OperationTimedoutException:
Commands
996939857 to Host 38 timed out after 7200
         at

org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
         at

org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
         at

org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
         at

org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
         at

org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)
         at

com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)
         at

com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)
         at

com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622)
         at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown
Source)
         at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
         at java.lang.reflect.Method.invoke(Method.java:622)
         at

org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
         at

org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
         at

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
         at

org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
         at

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
         at

org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
         at com.sun.proxy.$Proxy196.migrateVolume(Unknown Source)
         at

org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd.execute(MigrateVolumeCmd.java:103)
         at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161)
         at

com.cloud.api.ApiAsyncJobDispatcher.runJobInContext(ApiAsyncJobDispatcher.java:109)
         at

com.cloud.api.ApiAsyncJobDispatcher$1.run(ApiAsyncJobDispatcher.java:66)
         at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
         at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
         at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
         at

com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:63)
         at

org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:509)
         at

org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
         at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
         at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
         at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
         at

org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
         at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
         at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
         at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
         at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
         at java.lang.Thread.run(Thread.java:701)
2015-10-12 18:41:20,479 WARN  [o.a.c.s.d.ObjectInDataStoreManagerImpl]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Unsupported data object
(VOLUME,
org.apache.cloudstack.storage.datastore.PrimaryDataStoreImpl@4fa7a45f
),
no need to delete from object in store ref table
2015-10-12 18:41:20,479 DEBUG [c.c.s.VolumeApiServiceImpl]
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) migrate volume
failed:com.cloud.utils.exception.CloudRuntimeException: Failed to send
command, due to Agent:38,
com.cloud.exception.OperationTimedoutException:
Commands 996939857 to Host 38 timed out after 7200
2015-10-12 18:41:20,480 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(Job-Executor-63:ctx-f7b6817d) Complete async job-5257, jobStatus:
FAILED,
resultCode: 530, result:

org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed
to migrate volume"}
2015-10-12 18:41:20,486 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(Job-Executor-63:ctx-f7b6817d) Done executing
org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd for
job-5257
2015-10-12 18:41:20,489 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
(Job-Executor-63:ctx-f7b6817d) Remove job-5257 from job monitoring











________________________________________
From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 8:24 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: [Questionable] Re: Timeout with live
migration
Now I understand what you are doing, I am familiar with that concept
(live
migration of VM within a cluster, having the VHD being moved from one
SR
to
another).

I just got confused when I read live migration of volumes (a volume
does
not run by itself, so that why I asked a little for some more
information).
Looking at the source code this is the variable used to control the
timeout:
"long timeout = (_migratewait) * 1000L;"

The value of "_migratewait" is taken from this parameter:
value = (String) params.get("migratewait");
_migratewait = NumbersUtil.parseInt(value, 3600);

Therefore, the name of the parameter to be configured is "migratewait",
the
default value is 3600.


BTW1: I think that is a terrible parameter name. We should refactor
that,
could you open a Jira ticket for that?

BTW2: that error message you posted does not seem to be related to the
migration timeout; hence, in the code if the copy times out the message
would be:
"Async " + timeout/1000 + " seconds timeout for task " +
task.toString()"
Maybe because it throws a "Types.BadAsyncResult(msg)" and that might be
translated into that message, or that might not be related to the
problem
itself, and you just thought that it was.


Does it help you?


On Mon, Oct 12, 2015 at 10:00 PM, Ryan Farrington <
rfarring...@remitdata.com
wrote:
Hypervisor:  XenServer

We are moving a data volume from one storage onto another without
shutting
down the VM cause that would just be silly and a triplication of
effort
with the whole copying to secondary storage and then back off again.
The
volume is staying in the same cluster just moving to a different
Primary
storage (or SR in the XenServer vernacular)

If you are familiar with ESX this is a "Storage VMotion" where as in
XenServer it is called "Storage XenMotion".

________________________________________
From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 7:53 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

what do you mean with livre migrating data volume ?!
I understand a live migration of a VM, but volumes...

do you mean live migrating a VM that has a volume attached?
are you migrating that volume to a different cluster? or just a
different
storage in the same cluster?
What hypervisor are you using ?


On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington <
rfarring...@remitdata.com>
wrote:

Live migrating a data volume. We are purely on shared storage so no
local
storage is involved.

________________________________________
From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 7:37 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

Are you live migrating a VM, or migrating a volume of a stopped VM
to a
different primary storage?

If it is a running VM, is the VM allocated in a shared storage or
local
storage?

On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington <
rfarring...@remitdata.com>
wrote:

The slow transfer is related to the storage we are trying to
migrate
off
of.  We are capable of getting about 350mbps off the disks but
when
we
are
moving volumes that are greater than about 500GB we end up racing
the
clock
and hoping that the migration finishes before the job times out.
  It
would
be awesome to be able to manage that timeout and I know there
are a
ton
of
settings I just don't know about and am hoping someone might be
able
to
point me in the right direction.


________________________________________
From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 6:40 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

I would first check your NICs' speed and load, the amount of RAM
allocated
for the migrating VM and than check the hypervisor log files.

On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
jan.arve.nyg...@gmail.com>
wrote:

What version are you running? Check if the copy.volume.wait
setting
is
set
to 7200 and increase it. If not you could also check
job.cancel.threshold.minutes and job.expire.minutes.

-Jan-Arve

2015-10-13 0:46 GMT+02:00 Ryan Farrington <
rfarring...@remitdata.com
:
We are experiencing a failure in cloudstack waiting for an
async
job
performing a live migration of a volume to finish. I've
copied
the
relevant
log entries below.We acknowledge that the migration will
take a
few
hours
based on the volume of the data and we are looking for a way
to
increase
the timeout of 7200 seconds into something we know we can
work
with.

2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
(Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send
command,
due
to
Agent:27, com.cloud.exception.OperationTimedoutException:
Commands
835325398 to Host 27 timed out after 7200





--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner



--
Rafael Weingärtner

Reply via email to