For completeness, here's what I did..
## Show current timeout value.
[root@zs95kj ~]# date;pcs resource show zs95kjg110061_res
Tue Jan 17 13:36:05 EST 2017
Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61 target-role=Stopped
Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
migrate-to interval=0s *timeout=1200*
(zs95kjg110061_res-migrate-to-interval-0s)
## Change timeout value from 1200(ms) to 360s ...
[root@zs95kj ~]# pcs resource update zs95kjg110061_res op migrate-to
timeout="360s"
[root@zs95kj ~]# date;pcs resource show zs95kjg110061_res
Tue Jan 17 13:38:10 EST 2017
Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61 target-role=Stopped
Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
migrate-to interval=0s*timeout=360s*
(zs95kjg110061_res-migrate-to-interval-0s)
[root@zs95kj ~]# date;pcs resource enable zs95kjg110061_res
Tue Jan 17 13:40:55 EST 2017
[root@zs95kj ~]#
[root@zs95kj ~]# date;pcs resource show |grep zs95kjg110061_res
Tue Jan 17 13:41:16 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
## Started the I/O intensive 'blast' workload on the guest.
## Initiate the LGM via resource move CLI ..
[root@zs95kj ~]# date;pcs resource move zs95kjg110061_res zs90kppcs1
Tue Jan 17 13:54:50 EST 2017
## System log shows the following error thrown:
Jan 17 13:54:53 zs95kj crmd[27555]: notice: Operation
zs95kjg110061_stop_0: ok (node=zs95kjpcs1, call=44, rc=0,
cib-update=450, confirmed=true)
Jan 17 13:54:53 zs95kj attrd[27553]: notice: Removing all zs95kjg110061
attributes for zs95kjpcs1
Jan 17 13:54:54 zs95kj VirtualDomain(zs95kjg110061_res)[135045]: INFO:
zs95kjg110061: *Starting live migration to zs90kppcs1 (using: virsh
--connect=qemu:///system --quiet migrate --live zs95kjg110061
qemu+ssh://zs90kppcs1/system ).*
Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out
Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
zs95kjg110061_res_migrate_to_0:135045 - timed out after 20000ms
Jan 17 13:55:14 zs95kj crmd[27555]: error: Operation
zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941,
timeout=20000ms)
Jan 17 13:55:15 zs95kj VirtualDomain(zs95kjg110061_res)[136996]: INFO:
Issuing graceful shutdown request for domain zs95kjg110061.
Jan 17 13:55:26 zs95kj systemd-machined: Machine qemu-58-zs95kjg110061
terminated.
Jan 17 13:55:26 zs95kj crmd[27555]: notice: Operation
zs95kjg110061_res_stop_0: ok (node=zs95kjpcs1, call=943, rc=0,
cib-update=459, confirmed=true)
This is consistent with my original symptom... the "internal" timeout
value of 20000ms seems to override the migrate-to timeout value in the
resource,
if in fact ... the migrate-to operation attribute is what would control
the migration timeout in this case.
I happened to have full pacemaker debug enabled... so, here's generally
what I see in the /var/log/cluster/pacemaker.log associated with the
above migration and
resources associated with zs95kjg110061. There doesn't seem to be much
logged previous to the migration time out, which occurred at 13:55:14 :
Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2005 ) debug:
do_lrm_rsc_op: Stopped 0 recurring operations in preparation for
zs95kjg110061_res_migrate_to_0
Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2009 ) info: do_lrm_rsc_op:
Performing key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd
op=zs95kjg110061_res_migrate_to_0
Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:817 ) trace:
lrmd_send_command: sending lrmd_rsc_exec op to lrmd
Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:488 ) trace:
lrmd_create_op: Sending call options: 00000004, 4
Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1644 ) trace:
process_lrmd_message: Processing lrmd_rsc_exec operation from
a2a90d95-a766-42ba-84a0-54574c10adee
Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:347 ) trace:
schedule_lrmd_cmd: Scheduling migrate_to on zs95kjg110061_res
Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1684 ) debug:
process_lrmd_message: Processed lrmd_rsc_exec operation from
a2a90d95-a766-42ba-84a0-54574c10adee: rc=941, reply=1, notify=0,
exit=-2147471352
Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:131 ) info: log_execute:
executing - rsc:zs95kjg110061_res action:migrate_to call_id:941
Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1184 ) trace:
lrmd_rsc_execute_service_lib: Creating action,
resource:zs95kjg110061_res action:migrate_to class:ocf
provider:heartbeat agent:VirtualDomain
Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:846 ) trace:
lrmd_send_command: lrmd_rsc_exec op reply received
Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:852 ) trace:
lrmd_send_command: Reply <lrmd_reply lrmd_origin="send_reply"
lrmd_rc="941" lrmd_callid="941"/>
Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:300 ) trace:
lrmd_dispatch_internal: op lrmd_rsc_exec notify event received
Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2385 ) info:
process_lrm_event: Operation zs95kjg110061_res_monitor_30000: Cancelled
(node=zs95kjpcs1, call=936, confirmed=true)
Jan 17 *13:54:54* [27555] zs95kj crmd: ( lrm.c:196 ) debug:
update_history_cache: Updating history for 'zs95kjg110061_res' with
monitor op
VirtualDomain(zs95kjg110061_res)[135045]: 2017/01/17_13:54:54 DEBUG:
Virtual domain zs95kjg110061 is currently running.
VirtualDomain(zs95kjg110061_res)[135045]: 2017/01/17_13:54:54
INFO:*zs95kjg110061: Starting live migration to zs90kppcs1 (using: virsh
--connect=qemu:///system --quiet migrate --live zs95kjg110061
qemu+ssh://zs90kppcs1/system ).*
Jan 17 13:54:54 [27550] zs95kj cib: ( xml.c:1728 ) info: cib_perform_op:
+ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lr
m_resource[@id='zs95kjg110061_res']/lrm_rsc_op[@id='zs95kjg110061_res_last_0']:
@operation_key=zs95kjg110061_res_migrate_to_0, @operation=migrate_to,
@crm-debu
g-origin=cib_action_update,
@transition-key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd,
@transition-magic=-1:193;337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af67
7dd, @call-id=-1, @rc-code=193, @op-s
### 20 seconds after the live migration was initiated... we get the time
out:
Jan 17 13:55:14 [27552] zs95kj lrmd: ( mainloop.c:949 ) warning:
child_timeout_callback: zs95kjg110061_res_migrate_to_0 process (PID
135045) timed out
Jan 17 *13:55:14* [27552] zs95kj lrmd: (services_lin:314 ) warning:
operation_finished: zs95kjg110061_res_migrate_to_0:135045 - timed out
after 20000ms
Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:332 ) debug:
operation_finished: zs95kjg110061_res_migrate_to_0:135045:stderr [ --
empty -- ]
Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:336 ) debug:
operation_finished: zs95kjg110061_res_migrate_to_0:135045:stdout [ --
empty -- ]
Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:592 ) trace: cmd_finalize:
Resource operation rsc:zs95kjg110061_res action:migrate_to completed
(0xa5f1edd0 0xa5f1edd0)
Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:113 ) info: log_finished:
finished - rsc:zs95kjg110061_res action:migrate_to call_id:941
pid:135045 exit-code:1 exec-time:20003ms queue-time:0ms
Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:1292 ) trace:
lrmd_rsc_execute: Nothing further to do for zs95kjg110061_res
Jan 17 13:55:14 [27555] zs95kj crmd: ( utils.c:1942 ) debug:
create_operation_update: do_update_resource: Updating resource
zs95kjg110061_res after migrate_to op Timed Out (interval=0)
Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:2397 ) error:
process_lrm_event: Operation zs95kjg110061_res_migrate_to_0: Timed Out
(node=zs95kjpcs1, call=941, timeout=20000ms)
Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:196 ) debug:
update_history_cache: Updating history for 'zs95kjg110061_res' with
migrate_to op
Any ideas?
Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y.
INTERNET: swgre...@us.ibm.com
Inactive hide details for Ken Gaillot ---01/17/2017 11:41:53 AM---On
01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi..Ken Gaillot
---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese
wrote: > Hi..
From: Ken Gaillot <kgail...@redhat.com>
To: users@clusterlabs.org
Date: 01/17/2017 11:41 AM
Subject: Re: [ClusterLabs] Live Guest Migration timeouts for
VirtualDomain resources
------------------------------------------------------------------------
On 01/17/2017 10:19 AM, Scott Greenlese wrote:
Hi..
I've been testing live guest migration (LGM) with VirtualDomain
resources, which are guests running on Linux KVM / System Z
managed by pacemaker.
I'm looking for documentation that explains how to configure my
VirtualDomain resources such that they will not timeout
prematurely when there is a heavy I/O workload running on the guest.
If I perform the LGM with an unmanaged guest (resource disabled), it
takes anywhere from 2 - 5 minutes to complete the LGM.
Example:
# Migrate guest, specify a timeout value of 600s
[root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
--persistent --undefinesource*--timeout 600* --verbose zs95kjg110061
qemu+ssh://zs90kppcs1/system
Mon Jan 16 16:35:32 EST 2017
Migration: [100 %]
[root@zs95kj VD]# date
Mon Jan 16 16:40:01 EST 2017
[root@zs95kj VD]#
Start: 16:35:32
End: 16:40:01
Total: *4 min 29 sec*
In comparison, when the guest is managed by pacemaker, and enabled for
LGM ... I get this:
[root@zs95kj VD]# date;pcs resource show zs95kjg110061_res
Mon Jan 16 15:13:33 EST 2017
Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61
Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
*migrate-to* interval=0s *timeout=1200*
(zs95kjg110061_res-migrate-to-interval-0s)
NOTE: I didn't specify any migrate-to value for timeout, so it defaulted
to 1200. Is this seconds? If so, that's 20 minutes,
ample time to complete a 5 minute migration.
Not sure where the default of 1200 comes from, but I believe the default
is milliseconds if no unit is specified. Normally you'd specify
something like "timeout=1200s".
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:27:01 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
[root@zs95kj VD]#
[root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*
Mon Jan 16 14:45:39 EST 2017
You have new mail in /var/spool/mail/root
Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh
--connect=qemu:///system --quiet migrate --live zs95kjg110061
qemu+ssh://zs95kjpcs1/system ).*
Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out
Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms
Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation
zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,
timeout=20000ms)
Jan 16 14:45:58 zs90kp journal: operation failed: migration job:
unexpectedly failed
[root@zs90KP VD]#
So, the migration timed out after 20000ms. Assuming ms is milliseconds,
that's only 20 seconds. So, it seems that LGM timeout has
nothing to do with *migrate-to* on the resource definition.
Yes, ms is milliseconds. Pacemaker internally represents all times in
milliseconds, even though in most actual usage, it has 1-second granularity.
If your specified timeout is 1200ms, I'm not sure why it's using
20000ms. There may be a minimum enforced somewhere.
Also, what is the expected behavior when the migration times out? I
watched the VirtualDomain resource state during the migration process...
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:45:57 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:02 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:06 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:08 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:10 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:12 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:14 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:17 EST 2017
zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
[root@zs95kj VD]#
So, it seems as if the guest migration actually did succeed, at least
the guest is running
on the target node (KVM host). However... I checked the
Failure handling is configurable, but by default, if a live migration
fails, the cluster will do a full restart (= full stop then start). So
basically, it turns from a live migration to a cold migration.
"blast" IO workload (writes to external, virtual storage accessible to
both all cluster
hosts)
I can experiment with different *migrate-to* timeout value settings, but
would really
prefer to have a good understanding of timeout configuration and
recovery behavior first.
Thanks!
Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie,
N.Y.
INTERNET: swgre...@us.ibm.com