[Pacemaker] migration fix for ocf:heartbeat:Xen

Daugherity, Andrew W Thu, 11 Aug 2011 14:13:02 -0700

I have discovered that sometimes when migrating a VM, the migration itself will 
succeed, but the migrate_from call on the target node will fail, as apparently 
the status hasn't settled down yet.  This is more likely to happen when 
stopping pacemaker on a node, causing all its VMs to migrate away.  Migration 
succeeds, but then (sometimes) the status call in migrate_from fails, and the 
VM is unnecessarily stopped and started.  Note that it is NOT a timeout 
problem, as the migrate_from operation (which only checks status) takes less 
than a second.


I noticed the VirtualDomain RA does a loop rather than just checking the status 
once as the Xen RA does, so I patched a similar thing into the Xen RA, and that 
solved my problem.  I don't know what the appropriate time to wait is, so I 
copied the "2/3 of the timeout" logic from the stop function.  Perhaps the full 
timeout would be more appropriate?

Patch:
--- Xen.orig    2011-08-11 12:54:13.574795799 -0500
+++ Xen 2011-08-11 15:18:58.352851276 -0500
@@ -384,6 +384,20 @@
 }
 
 Xen_Migrate_From() {
+  if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then
+      # Allow 2/3 of the action timeout for status to stabilize
+      # (The origin unit is ms, hence the conversion)
+      timeout=$((OCF_RESKEY_CRM_meta_timeout/1500))
+  else
+      timeout=10               # should be plenty
+  fi
+
+  while ! Xen_Status ${DOMAIN_NAME} && [ $timeout -gt 0 ]; do
+    ocf_log debug "$DOMAIN_NAME: Not yet active locally, waiting (timeout: 
${timeout}s)"
+    timeout=$((timeout-1))
+    sleep 1
+  done
+
   if Xen_Status ${DOMAIN_NAME}; then
     Xen_Adjust_Memory 0
     ocf_log info "$DOMAIN_NAME: Active locally, migration successful"
=== end patch ===


Example resource definition:
primitive vm-build11 ocf:heartbeat:Xen \
        params xmfile="/mnt/xen/vm/build11" \
        meta allow-migrate="true" target-role="Started" is-managed="true" \
        utilization cores="1" mem="512" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op migrate_to interval="0" timeout="180" \
        op monitor interval="30" timeout="30" start-delay="60"

Debug log excerpts (grep build11):

source node (app-02)
====
Aug 11 12:08:33 app-02 lrmd: [9952]: debug: on_msg_perform_op: add an operation 
operation migrate_to[50] on ocf::Xen::vm-build11 for client 9955, its 
parameters: CRM_meta_migrate_source=[app-02] CRM_meta_migrate_target=[app-04] 
CRM_meta_record_pending=[false] CRM_meta_name=[migrate_to] 
CRM_meta_timeout=[180000] crm_feature_set=[3.0.5] xmfile=[/mnt/xen/vm/build11]  
to the operation list.
Aug 11 12:08:33 app-02 lrmd: [9952]: info: rsc:vm-build11:50: migrate_to
Aug 11 12:08:33 app-02 crmd: [9955]: info: process_lrm_event: LRM operation 
vm-build11_monitor_30000 (call=49, status=1, cib-update=0, confirmed=true) 
Cancelled
Aug 11 12:08:33 app-02 Xen[29909]: [29984]: INFO: build11: Starting xm migrate 
to app-04
Aug 11 12:08:43 app-02 Xen[29909]: [30228]: INFO: build11: xm migrate to app-04 
succeeded.
Aug 11 12:08:43 app-02 lrmd: [9952]: info: Managed vm-build11:migrate_to 
process 29909 exited with return code 0.
Aug 11 12:08:43 app-02 crmd: [9955]: debug: create_operation_update: 
do_update_resource: Updating resouce vm-build11 after complete migrate_to op 
(interval=0)
Aug 11 12:08:43 app-02 crmd: [9955]: info: process_lrm_event: LRM operation 
vm-build11_migrate_to_0 (call=50, rc=0, cib-update=75, confirmed=true) ok
Aug 11 12:08:43 app-02 crmd: [9955]: info: do_lrm_rsc_op: Performing 
key=141:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_stop_0 )
Aug 11 12:08:43 app-02 lrmd: [9952]: debug: on_msg_perform_op: add an operation 
operation stop[52] on ocf::Xen::vm-build11 for client 9955, its parameters: 
crm_feature_set=[3.0.5]  to the operation list.
Aug 11 12:08:43 app-02 lrmd: [9952]: info: rsc:vm-build11:52: stop
Aug 11 12:08:44 app-02 Xen[30381]: [30469]: INFO: Xen domain build11 already 
stopped.
Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr) 
Error:
Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr)  
Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr) 
Domain 'build11' does not exist.
Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr) 
Aug 11 12:08:44 app-02 lrmd: [9952]: info: Managed vm-build11:stop process 
30381 exited with return code 0.
Aug 11 12:08:44 app-02 crmd: [9955]: debug: create_operation_update: 
do_update_resource: Updating resouce vm-build11 after complete stop op 
(interval=0)
Aug 11 12:08:44 app-02 crmd: [9955]: info: process_lrm_event: LRM operation 
vm-build11_stop_0 (call=52, rc=0, cib-update=78, confirmed=true) ok
====
(I'm guessing the "already stopped" after migrate_to succeeds is not a problem.)


target node (app-04):
====
Aug 11 12:08:43 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing 
key=200:68:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_migrate_from_0 )
Aug 11 12:08:43 app-04 lrmd: [9940]: info: rsc:vm-build11:88: migrate_from
Aug 11 12:08:43 app-04 Xen[1574]: [1619]: ERROR: build11: Not active locally, 
migration failed!
Aug 11 12:08:43 app-04 crmd: [9943]: info: process_lrm_event: LRM operation 
vm-build11_migrate_from_0 (call=88, rc=1, cib-update=144, confirmed=true) 
unknown error
Aug 11 12:08:43 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing 
key=16:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_stop_0 )
Aug 11 12:08:43 app-04 lrmd: [9940]: info: rsc:vm-build11:89: stop
Aug 11 12:08:44 app-04 Xen[1744]: [1852]: INFO: Xen domain build11 will be 
stopped (timeout: 20s)
Aug 11 12:08:51 app-04 Xen[1744]: [2146]: INFO: Xen domain build11 stopped.
Aug 11 12:08:51 app-04 crmd: [9943]: info: process_lrm_event: LRM operation 
vm-build11_stop_0 (call=89, rc=0, cib-update=145, confirmed=true) ok
Aug 11 12:09:06 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing 
key=142:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_start_0 )
Aug 11 12:09:06 app-04 lrmd: [9940]: info: rsc:vm-build11:91: start
Aug 11 12:09:10 app-04 lrmd: [9940]: info: RA output: (vm-build11:start:stdout) 
Using config file "/mnt/xen/vm/build11".#012Started domain build11 (id=18)
Aug 11 12:09:11 app-04 crmd: [9943]: info: process_lrm_event: LRM operation 
vm-build11_start_0 (call=91, rc=0, cib-update=147, confirmed=true) ok
Aug 11 12:09:11 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing 
key=143:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_monitor_30000 )
Aug 11 12:10:11 app-04 lrmd: [9940]: info: rsc:vm-build11:93: monitor
Aug 11 12:10:11 app-04 crmd: [9943]: info: process_lrm_event: LRM operation 
vm-build11_monitor_30000 (call=93, rc=0, cib-update=149, confirmed=false) ok
====
Note the "Not active locally, migration failed!" error and subsequent 
stop/start.

Aside: why is the stop timeout 20 sec?  Looking at the Xen RA man page and 
source, it seems like it should use the shutdown_timeout parameter, or it that 
is not set, 2/3 of the stop timeout.  I don't have shutdown_timeout set for 
this resource, and the stop timeout is 60, but somehow it's getting 20 instead 
of 40.


With my patch in place, everything is fine after a couple seconds:
(app-02 is the migration target in this case)
====
Aug 11 15:50:08 app-02 lrmd: [22253]: debug: on_msg_perform_op: add an 
operation operation migrate_from[40] on ocf::Xen::vm-build11 for client 22256, 
its parameters: CRM_meta_migrate_source=[app-04] 
CRM_meta_migrate_target=[app-02] CRM_meta_record_pending=[false] 
CRM_meta_timeout=[30000] crm_feature_set=[3.0.5] xmfile=[/mnt/xen/vm/build11]  
to the operation list.
Aug 11 15:50:08 app-02 Xen[24984]: [25055]: DEBUG: build11: Not yet active 
locally, waiting (timeout: 20s)
Aug 11 15:50:09 app-02 Xen[24984]: [25163]: INFO: build11: Active locally, 
migration successful
====

Incidentally, on a test where I stopped pacemaker on a node running 5 VMs, only 
one of them required this wait; the other four immediately returned "Active 
locally, migration successful".  It's a race condition mitigated by the wait 
loop, I guess.

Software is SLES 11 HAE SP1 + latest updates (Xen RA belongs to 
resource-agents-1.0.3-0.10.1).


Thanks,

Andrew Daugherity
Systems Analyst
Division of Research, Texas A&M University

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

[Pacemaker] migration fix for ocf:heartbeat:Xen

Reply via email to