Hi.. I've been testing live guest migration (LGM) with VirtualDomain resources, which are guests running on Linux KVM / System Z managed by pacemaker.
I'm looking for documentation that explains how to configure my VirtualDomain resources such that they will not timeout prematurely when there is a heavy I/O workload running on the guest. If I perform the LGM with an unmanaged guest (resource disabled), it takes anywhere from 2 - 5 minutes to complete the LGM. Example: # Migrate guest, specify a timeout value of 600s [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live --persistent --undefinesource --timeout 600 --verbose zs95kjg110061 qemu +ssh://zs90kppcs1/system Mon Jan 16 16:35:32 EST 2017 Migration: [100 %] [root@zs95kj VD]# date Mon Jan 16 16:40:01 EST 2017 [root@zs95kj VD]# Start: 16:35:32 End: 16:40:01 Total: 4 min 29 sec In comparison, when the guest is managed by pacemaker, and enabled for LGM ... I get this: [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res Mon Jan 16 15:13:33 EST 2017 Resource: zs95kjg110061_res (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/guestxml/nfs1/zs95kjg110061.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 remote-addr=10.20.110.61 Operations: start interval=0s timeout=480 (zs95kjg110061_res-start-interval-0s) stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s) monitor interval=30s (zs95kjg110061_res-monitor-interval-30s) migrate-from interval=0s timeout=1200 (zs95kjg110061_res-migrate-from-interval-0s) migrate-to interval=0s timeout=1200 (zs95kjg110061_res-migrate-to-interval-0s) NOTE: I didn't specify any migrate-to value for timeout, so it defaulted to 1200. Is this seconds? If so, that's 20 minutes, ample time to complete a 5 minute migration. [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:27:01 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 [root@zs95kj VD]# [root@zs95kj VD]# date;pcs resource move zs95kjg110061_res zs95kjpcs1 Mon Jan 16 14:45:39 EST 2017 You have new mail in /var/spool/mail/root Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO: zs95kjg110061: Starting live migration to zs95kjpcs1 (using: virsh --connect=qemu:///system --quiet migrate --live zs95kjg110061 qemu +ssh://zs95kjpcs1/system ). Jan 16 14:45:57 zs90kp lrmd[12798]: warning: zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out Jan 16 14:45:57 zs90kp lrmd[12798]: warning: zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978, timeout=20000ms) Jan 16 14:45:58 zs90kp journal: operation failed: migration job: unexpectedly failed [root@zs90KP VD]# So, the migration timed out after 20000ms. Assuming ms is milliseconds, that's only 20 seconds. So, it seems that LGM timeout has nothing to do with migrate-to on the resource definition. Also, what is the expected behavior when the migration times out? I watched the VirtualDomain resource state during the migration process... [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:45:57 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:02 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:06 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:08 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:10 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:12 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:14 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res Mon Jan 16 14:46:17 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 [root@zs95kj VD]# So, it seems as if the guest migration actually did succeed, at least the guest is running on the target node (KVM host). However... I checked the "blast" IO workload (writes to external, virtual storage accessible to both all cluster hosts) I can experiment with different migrate-to timeout value settings, but would really prefer to have a good understanding of timeout configuration and recovery behavior first. Thanks! Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org