Re: [ClusterLabs Developers] commit abcdaa8 breaks compatibility with older pacemaker

Alex Lyakas Tue, 07 Jun 2016 03:53:27 -0700

Hello Andrew,

Thank you for your response.


We have a two-node cluster, and we need to upgrade pacemaker at both nodes.

We ended up applying locally the patch[1], which sends explicit ACK if it 
matches the old version of pacemaker.

Thanks,
Alex.


[1]
--- a/pacemaker/pacemaker-1.1.13/crmd/lrm.c
+++ b/pacemaker/pacemaker-1.1.13/crmd/lrm.c
@@ -1541,20 +1541,44 @@ do_lrm_invoke(long long action,
                 op->rc = PCMK_OCF_OK;
                 op->op_status = PCMK_LRM_OP_DONE;
                 send_direct_ack(from_host, from_sys, rsc, op, rsc->id);
                 lrmd_free_event(op);

                 /* needed?? surely not otherwise the cancel_op_(_key) wouldn't
                  * have failed in the first place
                  */
                 g_hash_table_remove(lrm_state->pending_ops, op_key);
             }
+            else {
+                const char *feature_set = NULL;
+                gboolean need_direct_ack = FALSE;
+
+                /*
+                 * For uprading from older versions, we need to send explicit 
ACK.
+                 * See:
+                 * 
https://github.com/ClusterLabs/pacemaker/commit/abcdaa8893d6071574986af6abc85ae558473735
+                 * 
http://clusterlabs.org/pipermail/developers/2016-May/000219.html
+                 */
+                feature_set = crm_element_value(params, XML_ATTR_CRM_VERSION);
+                need_direct_ack = safe_str_eq(feature_set, "3.0.5");
+                crm_notice("PE requested op %s (call=%s) be cancelled 
in_progress==TRUE feature_set=%s need_direct_ack=%d",
+                          op_key, call_id ? call_id : "NA", feature_set, 
need_direct_ack);
+                if (need_direct_ack) {
+                    lrmd_event_data_t *op = construct_op(lrm_state, 
input->xml, rsc->id, op_task);
+
+                    CRM_ASSERT(op != NULL);
+                    op->rc = PCMK_OCF_OK;
+                    op->op_status = PCMK_LRM_OP_DONE;
+                    send_direct_ack(from_host, from_sys, rsc, op, rsc->id);
+                    lrmd_free_event(op);
+                }
+            }

             free(op_key);

         } else if (rsc != NULL && safe_str_eq(operation, CRMD_ACTION_DELETE)) {
             gboolean unregister = TRUE;

#if ENABLE_ACL
             int cib_rc = delete_rsc_status(lrm_state, rsc->id, cib_dryrun | 
cib_sync_call, user_name);
             if (cib_rc != pcmk_ok) {
                 lrmd_event_data_t *op = NULL;



From: Andrew Beekhof 
Sent: Friday, June 03, 2016 3:36 AM
To: Alex Lyakas 
Cc: [email protected] ; Yair Hershko ; Shyam Kaushik ; Yaron Presente 
; Lev Vainblat 
Subject: Re: commit abcdaa8 breaks compatibility with older pacemaker


  On 24 May 2016, at 2:23 AM, Alex Lyakas <[email protected]> wrote:

  Hello Andrew,

  We have a system in the field running a MASTER-SLAVE resource on two nodes. 
We are trying to upgrade the pacemaker on these two nodes. First we upgrade the 
SLAVE node. Then we move the resource to be MASTER on the upgraded SLAVE node 
(“crm node standby” on the old MASTER). This move involves cancelling a monitor 
operation on the SLAVE node.

  With commit
  
https://github.com/ClusterLabs/pacemaker/commit/abcdaa8893d6071574986af6abc85ae558473735
  there is a change of how the “cancel” action is confirmed.

  Previously, send_direct_ack was always used to confirm the cancel action. But 
now, the cancel action is being confirmed not by direct ACK but by parsing the 
XML.


Oh, and you’re mixing pacemaker versions.
I can see how that would be a problem.

Are you seeing this in the process of upgrading the entire cluster is the plan 
just to update one?



  So the new node receives the cancel action, but doesn’t call send_direct_ack. 
As a result on the old node, it sends the cancel action:
  May 23 18:05:49 vsa-000001be-vc-0 crmd: [3089]: info: te_rsc_command: 
Initiating action 4: cancel VAM:1_monitor_5000 on vsa-000001be-vc-1

  And after 3 minutes only it moves forward due to timeout
  May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: WARN: action_timer_callback: 
Timer popped (timeout=120000, abort_level=1000000, complete=false)
  May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: ERROR: print_elem: Aborting 
transition, action lost: [Action 4]: In-flight (id: VAM:1_monitor_5000, loc: 
vsa-000001be-vc-1, priority: 0)
  May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: info: abort_transition_graph: 
action_timer_callback:512 - Triggered transition abort (complete=0) : Action 
lost

  However, the 3 minute-timeout is unacceptable for our customers.

  What would you recommend to fix this backward compatibility issue?


Unfortunately, you might need to resort to the detach+upgrade 
everything+reattach method of upgrading as described here:

     
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_disconnect_and_reattach.html



  Only as a test, I called send_direct_ack in case “in_progress==TRUE” also. 
This fixed the problem, as the older node received the needed ACK. But I don’t 
know what this change might break.


It would probably be fine as a transition plan.
Ie. first do a rolling update to the patched version, then another to the 
unpatched version.



  Thanks,
  Alex.

_______________________________________________
Developers mailing list
[email protected]
http://clusterlabs.org/mailman/listinfo/developers

Re: [ClusterLabs Developers] commit abcdaa8 breaks compatibility with older pacemaker

Reply via email to