On 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi.. > > I've been testing live guest migration (LGM) with VirtualDomain > resources, which are guests running on Linux KVM / System Z > managed by pacemaker. > > I'm looking for documentation that explains how to configure my > VirtualDomain resources such that they will not timeout > prematurely when there is a heavy I/O workload running on the guest. > > If I perform the LGM with an unmanaged guest (resource disabled), it > takes anywhere from 2 - 5 minutes to complete the LGM. > Example: > > # Migrate guest, specify a timeout value of 600s > > [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live > --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061 > qemu+ssh://zs90kppcs1/system > Mon Jan 16 16:35:32 EST 2017 > > Migration: [100 %] > > [root@zs95kj VD]# date > Mon Jan 16 16:40:01 EST 2017 > [root@zs95kj VD]# > > Start: 16:35:32 > End: 16:40:01 > Total: *4 min 29 sec* > > > In comparison, when the guest is managed by pacemaker, and enabled for > LGM ... I get this: > > [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res > Mon Jan 16 15:13:33 EST 2017 > Resource: zs95kjg110061_res (class=ocf provider=heartbeat > type=VirtualDomain) > Attributes: config=/guestxml/nfs1/zs95kjg110061.xml > hypervisor=qemu:///system migration_transport=ssh > Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 > remote-addr=10.20.110.61 > Operations: start interval=0s timeout=480 > (zs95kjg110061_res-start-interval-0s) > stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s) > monitor interval=30s (zs95kjg110061_res-monitor-interval-30s) > migrate-from interval=0s timeout=1200 > (zs95kjg110061_res-migrate-from-interval-0s) > *migrate-to* interval=0s *timeout=1200* > (zs95kjg110061_res-migrate-to-interval-0s) > > NOTE: I didn't specify any migrate-to value for timeout, so it defaulted > to 1200. Is this seconds? If so, that's 20 minutes, > ample time to complete a 5 minute migration.
Not sure where the default of 1200 comes from, but I believe the default is milliseconds if no unit is specified. Normally you'd specify something like "timeout=1200s". > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:27:01 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 > [root@zs95kj VD]# > > > [root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1* > Mon Jan 16 14:45:39 EST 2017 > You have new mail in /var/spool/mail/root > > > Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO: > zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh > --connect=qemu:///system --quiet migrate --live zs95kjg110061 > qemu+ssh://zs95kjpcs1/system ).* > Jan 16 14:45:57 zs90kp lrmd[12798]: warning: > zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out > Jan 16 14:45:57 zs90kp lrmd[12798]: warning: > zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms > Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation > zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978, > timeout=20000ms) > Jan 16 14:45:58 zs90kp journal: operation failed: migration job: > unexpectedly failed > [root@zs90KP VD]# > > So, the migration timed out after 20000ms. Assuming ms is milliseconds, > that's only 20 seconds. So, it seems that LGM timeout has > nothing to do with *migrate-to* on the resource definition. Yes, ms is milliseconds. Pacemaker internally represents all times in milliseconds, even though in most actual usage, it has 1-second granularity. If your specified timeout is 1200ms, I'm not sure why it's using 20000ms. There may be a minimum enforced somewhere. > Also, what is the expected behavior when the migration times out? I > watched the VirtualDomain resource state during the migration process... > > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:45:57 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:02 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:06 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:08 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:10 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:12 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:14 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:17 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 > [root@zs95kj VD]# > > > So, it seems as if the guest migration actually did succeed, at least > the guest is running > on the target node (KVM host). However... I checked the Failure handling is configurable, but by default, if a live migration fails, the cluster will do a full restart (= full stop then start). So basically, it turns from a live migration to a cold migration. > "blast" IO workload (writes to external, virtual storage accessible to > both all cluster > hosts) > > I can experiment with different *migrate-to* timeout value settings, but > would really > prefer to have a good understanding of timeout configuration and > recovery behavior first. > > Thanks! > > > Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org