Ken and Co, Thanks for the useful information.
I bumped the migrate-to timeout value from 1200ms to 360s , which should be more than enough time to successfully migrate the resource (i.e. the KVM guest). The migration was again interrupted with a timeout at the 20000ms (20 second) mark, thus stopping / failing the resource, which shuts down the guest, then reassigning the resource to the target node, then cold starting the resource and guest on the target. Does anybody know where this 20000ms timeout value gets set? 20sec is clearly inadequate for a virtual machine that is running with any kind or I/O or memory intensive workload. I need to have a way to change that timeout value, if not by the migrate-to timeout value, then what? Is this internally coded within the class=ocf provider=heartbeat type=VirtualDomain resource agent? For completeness, here's what I did.. ## Show current timeout value. [root@zs95kj ~]# date;pcs resource show zs95kjg110061_res Tue Jan 17 13:36:05 EST 2017 Resource: zs95kjg110061_res (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/guestxml/nfs1/zs95kjg110061.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 remote-addr=10.20.110.61 target-role=Stopped Operations: start interval=0s timeout=480 (zs95kjg110061_res-start-interval-0s) stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s) monitor interval=30s (zs95kjg110061_res-monitor-interval-30s) migrate-from interval=0s timeout=1200 (zs95kjg110061_res-migrate-from-interval-0s) migrate-to interval=0s timeout=1200 (zs95kjg110061_res-migrate-to-interval-0s) ## Change timeout value from 1200(ms) to 360s ... [root@zs95kj ~]# pcs resource update zs95kjg110061_res op migrate-to timeout="360s" [root@zs95kj ~]# date;pcs resource show zs95kjg110061_res Tue Jan 17 13:38:10 EST 2017 Resource: zs95kjg110061_res (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/guestxml/nfs1/zs95kjg110061.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 remote-addr=10.20.110.61 target-role=Stopped Operations: start interval=0s timeout=480 (zs95kjg110061_res-start-interval-0s) stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s) monitor interval=30s (zs95kjg110061_res-monitor-interval-30s) migrate-from interval=0s timeout=1200 (zs95kjg110061_res-migrate-from-interval-0s) migrate-to interval=0s timeout=360s (zs95kjg110061_res-migrate-to-interval-0s) [root@zs95kj ~]# date;pcs resource enable zs95kjg110061_res Tue Jan 17 13:40:55 EST 2017 [root@zs95kj ~]# [root@zs95kj ~]# date;pcs resource show |grep zs95kjg110061_res Tue Jan 17 13:41:16 EST 2017 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 ## Started the I/O intensive 'blast' workload on the guest. ## Initiate the LGM via resource move CLI .. [root@zs95kj ~]# date;pcs resource move zs95kjg110061_res zs90kppcs1 Tue Jan 17 13:54:50 EST 2017 ## System log shows the following error thrown: Jan 17 13:54:53 zs95kj crmd[27555]: notice: Operation zs95kjg110061_stop_0: ok (node=zs95kjpcs1, call=44, rc=0, cib-update=450, confirmed=true) Jan 17 13:54:53 zs95kj attrd[27553]: notice: Removing all zs95kjg110061 attributes for zs95kjpcs1 Jan 17 13:54:54 zs95kj VirtualDomain(zs95kjg110061_res)[135045]: INFO: zs95kjg110061: Starting live migration to zs90kppcs1 (using: virsh --connect=qemu:///system --quiet migrate --live zs95kjg110061 qemu +ssh://zs90kppcs1/system ). Jan 17 13:55:14 zs95kj lrmd[27552]: warning: zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out Jan 17 13:55:14 zs95kj lrmd[27552]: warning: zs95kjg110061_res_migrate_to_0:135045 - timed out after 20000ms Jan 17 13:55:14 zs95kj crmd[27555]: error: Operation zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941, timeout=20000ms) Jan 17 13:55:15 zs95kj VirtualDomain(zs95kjg110061_res)[136996]: INFO: Issuing graceful shutdown request for domain zs95kjg110061. Jan 17 13:55:26 zs95kj systemd-machined: Machine qemu-58-zs95kjg110061 terminated. Jan 17 13:55:26 zs95kj crmd[27555]: notice: Operation zs95kjg110061_res_stop_0: ok (node=zs95kjpcs1, call=943, rc=0, cib-update=459, confirmed=true) >> This is consistent with my original symptom... the "internal" timeout value of 20000ms seems to override the migrate-to timeout value in the resource, if in fact ... the migrate-to operation attribute is what would control the migration timeout in this case. I happened to have full pacemaker debug enabled... so, here's generally what I see in the /var/log/cluster/pacemaker.log associated with the above migration and resources associated with zs95kjg110061. There doesn't seem to be much logged previous to the migration time out, which occurred at 13:55:14 : Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2005 ) debug: do_lrm_rsc_op: Stopped 0 recurring operations in preparation for zs95kjg110061_res_migrate_to_0 Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2009 ) info: do_lrm_rsc_op: Performing key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd op=zs95kjg110061_res_migrate_to_0 Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:817 ) trace: lrmd_send_command: sending lrmd_rsc_exec op to lrmd Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:488 ) trace: lrmd_create_op: Sending call options: 00000004, 4 Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1644 ) trace: process_lrmd_message: Processing lrmd_rsc_exec operation from a2a90d95-a766-42ba-84a0-54574c10adee Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:347 ) trace: schedule_lrmd_cmd: Scheduling migrate_to on zs95kjg110061_res Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1684 ) debug: process_lrmd_message: Processed lrmd_rsc_exec operation from a2a90d95-a766-42ba-84a0-54574c10adee: rc=941, reply=1, notify=0, exit=-2147471352 Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:131 ) info: log_execute: executing - rsc:zs95kjg110061_res action:migrate_to call_id:941 Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1184 ) trace: lrmd_rsc_execute_service_lib: Creating action, resource:zs95kjg110061_res action:migrate_to class:ocf provider:heartbeat agent:VirtualDomain Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:846 ) trace: lrmd_send_command: lrmd_rsc_exec op reply received Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:852 ) trace: lrmd_send_command: Reply <lrmd_reply lrmd_origin="send_reply" lrmd_rc="941" lrmd_callid="941"/> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:300 ) trace: lrmd_dispatch_internal: op lrmd_rsc_exec notify event received Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2385 ) info: process_lrm_event: Operation zs95kjg110061_res_monitor_30000: Cancelled (node=zs95kjpcs1, call=936, confirmed=true) Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:196 ) debug: update_history_cache: Updating history for 'zs95kjg110061_res' with monitor op VirtualDomain(zs95kjg110061_res)[135045]: 2017/01/17_13:54:54 DEBUG: Virtual domain zs95kjg110061 is currently running. VirtualDomain(zs95kjg110061_res)[135045]: 2017/01/17_13:54:54 INFO: zs95kjg110061: Starting live migration to zs90kppcs1 (using: virsh --connect=qemu:///system --quiet migrate --live zs95kjg110061 qemu +ssh://zs90kppcs1/system ). Jan 17 13:54:54 [27550] zs95kj cib: ( xml.c:1728 ) info: cib_perform_op: + /cib/status/node_state[@id='2']/lrm [@id='2']/lrm_resources/lr m_resource[@id='zs95kjg110061_res']/lrm_rsc_op [@id='zs95kjg110061_res_last_0']: @operation_key=zs95kjg110061_res_migrate_to_0, @operation=migrate_to, @crm-debu g-origin=cib_action_update, @transition-key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd, @transition-magic=-1:193;337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af67 7dd, @call-id=-1, @rc-code=193, @op-s ### 20 seconds after the live migration was initiated... we get the time out: Jan 17 13:55:14 [27552] zs95kj lrmd: ( mainloop.c:949 ) warning: child_timeout_callback: zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:314 ) warning: operation_finished: zs95kjg110061_res_migrate_to_0:135045 - timed out after 20000ms Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:332 ) debug: operation_finished: zs95kjg110061_res_migrate_to_0:135045:stderr [ -- empty -- ] Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:336 ) debug: operation_finished: zs95kjg110061_res_migrate_to_0:135045:stdout [ -- empty -- ] Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:592 ) trace: cmd_finalize: Resource operation rsc:zs95kjg110061_res action:migrate_to completed (0xa5f1edd0 0xa5f1edd0) Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:113 ) info: log_finished: finished - rsc:zs95kjg110061_res action:migrate_to call_id:941 pid:135045 exit-code:1 exec-time:20003ms queue-time:0ms Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:1292 ) trace: lrmd_rsc_execute: Nothing further to do for zs95kjg110061_res Jan 17 13:55:14 [27555] zs95kj crmd: ( utils.c:1942 ) debug: create_operation_update: do_update_resource: Updating resource zs95kjg110061_res after migrate_to op Timed Out (interval=0) Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:2397 ) error: process_lrm_event: Operation zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941, timeout=20000ms) Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:196 ) debug: update_history_cache: Updating history for 'zs95kjg110061_res' with migrate_to op Any ideas? Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com From: Ken Gaillot <kgail...@redhat.com> To: users@clusterlabs.org Date: 01/17/2017 11:41 AM Subject: Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources On 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi.. > > I've been testing live guest migration (LGM) with VirtualDomain > resources, which are guests running on Linux KVM / System Z > managed by pacemaker. > > I'm looking for documentation that explains how to configure my > VirtualDomain resources such that they will not timeout > prematurely when there is a heavy I/O workload running on the guest. > > If I perform the LGM with an unmanaged guest (resource disabled), it > takes anywhere from 2 - 5 minutes to complete the LGM. > Example: > > # Migrate guest, specify a timeout value of 600s > > [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live > --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061 > qemu+ssh://zs90kppcs1/system > Mon Jan 16 16:35:32 EST 2017 > > Migration: [100 %] > > [root@zs95kj VD]# date > Mon Jan 16 16:40:01 EST 2017 > [root@zs95kj VD]# > > Start: 16:35:32 > End: 16:40:01 > Total: *4 min 29 sec* > > > In comparison, when the guest is managed by pacemaker, and enabled for > LGM ... I get this: > > [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res > Mon Jan 16 15:13:33 EST 2017 > Resource: zs95kjg110061_res (class=ocf provider=heartbeat > type=VirtualDomain) > Attributes: config=/guestxml/nfs1/zs95kjg110061.xml > hypervisor=qemu:///system migration_transport=ssh > Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 > remote-addr=10.20.110.61 > Operations: start interval=0s timeout=480 > (zs95kjg110061_res-start-interval-0s) > stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s) > monitor interval=30s (zs95kjg110061_res-monitor-interval-30s) > migrate-from interval=0s timeout=1200 > (zs95kjg110061_res-migrate-from-interval-0s) > *migrate-to* interval=0s *timeout=1200* > (zs95kjg110061_res-migrate-to-interval-0s) > > NOTE: I didn't specify any migrate-to value for timeout, so it defaulted > to 1200. Is this seconds? If so, that's 20 minutes, > ample time to complete a 5 minute migration. Not sure where the default of 1200 comes from, but I believe the default is milliseconds if no unit is specified. Normally you'd specify something like "timeout=1200s". > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:27:01 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 > [root@zs95kj VD]# > > > [root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1* > Mon Jan 16 14:45:39 EST 2017 > You have new mail in /var/spool/mail/root > > > Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO: > zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh > --connect=qemu:///system --quiet migrate --live zs95kjg110061 > qemu+ssh://zs95kjpcs1/system ).* > Jan 16 14:45:57 zs90kp lrmd[12798]: warning: > zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out > Jan 16 14:45:57 zs90kp lrmd[12798]: warning: > zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms > Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation > zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978, > timeout=20000ms) > Jan 16 14:45:58 zs90kp journal: operation failed: migration job: > unexpectedly failed > [root@zs90KP VD]# > > So, the migration timed out after 20000ms. Assuming ms is milliseconds, > that's only 20 seconds. So, it seems that LGM timeout has > nothing to do with *migrate-to* on the resource definition. Yes, ms is milliseconds. Pacemaker internally represents all times in milliseconds, even though in most actual usage, it has 1-second granularity. If your specified timeout is 1200ms, I'm not sure why it's using 20000ms. There may be a minimum enforced somewhere. > Also, what is the expected behavior when the migration times out? I > watched the VirtualDomain resource state during the migration process... > > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:45:57 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:02 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:06 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:08 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:10 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:12 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:14 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 > [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res > Mon Jan 16 14:46:17 EST 2017 > zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 > [root@zs95kj VD]# > > > So, it seems as if the guest migration actually did succeed, at least > the guest is running > on the target node (KVM host). However... I checked the Failure handling is configurable, but by default, if a live migration fails, the cluster will do a full restart (= full stop then start). So basically, it turns from a live migration to a cold migration. > "blast" IO workload (writes to external, virtual storage accessible to > both all cluster > hosts) > > I can experiment with different *migrate-to* timeout value settings, but > would really > prefer to have a good understanding of timeout configuration and > recovery behavior first. > > Thanks! > > > Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org