> On 24 May 2016, at 2:23 AM, Alex Lyakas <[email protected]> wrote:
> 
> Hello Andrew,
> 
> We have a system in the field running a MASTER-SLAVE resource on two nodes. 
> We are trying to upgrade the pacemaker on these two nodes. First we upgrade 
> the SLAVE node. Then we move the resource to be MASTER on the upgraded SLAVE 
> node (“crm node standby” on the old MASTER). This move involves cancelling a 
> monitor operation on the SLAVE node.
> 
> With commit
> https://github.com/ClusterLabs/pacemaker/commit/abcdaa8893d6071574986af6abc85ae558473735
> there is a change of how the “cancel” action is confirmed.
> 
> Previously, send_direct_ack was always used to confirm the cancel action. But 
> now, the cancel action is being confirmed not by direct ACK but by parsing 
> the XML.

Oh, and you’re mixing pacemaker versions.
I can see how that would be a problem.

Are you seeing this in the process of upgrading the entire cluster is the plan 
just to update one?

> 
> So the new node receives the cancel action, but doesn’t call send_direct_ack. 
> As a result on the old node, it sends the cancel action:
> May 23 18:05:49 vsa-000001be-vc-0 crmd: [3089]: info: te_rsc_command: 
> Initiating action 4: cancel VAM:1_monitor_5000 on vsa-000001be-vc-1
> 
> And after 3 minutes only it moves forward due to timeout
> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: WARN: action_timer_callback: 
> Timer popped (timeout=120000, abort_level=1000000, complete=false)
> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: ERROR: print_elem: Aborting 
> transition, action lost: [Action 4]: In-flight (id: VAM:1_monitor_5000, loc: 
> vsa-000001be-vc-1, priority: 0)
> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: info: abort_transition_graph: 
> action_timer_callback:512 - Triggered transition abort (complete=0) : Action 
> lost
> 
> However, the 3 minute-timeout is unacceptable for our customers.
> 
> What would you recommend to fix this backward compatibility issue?

Unfortunately, you might need to resort to the detach+upgrade 
everything+reattach method of upgrading as described here:

     
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_disconnect_and_reattach.html
 
<http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_disconnect_and_reattach.html>
> 
> Only as a test, I called send_direct_ack in case “in_progress==TRUE” also. 
> This fixed the problem, as the older node received the needed ACK. But I 
> don’t know what this change might break.

It would probably be fine as a transition plan.
Ie. first do a rolling update to the patched version, then another to the 
unpatched version.

> 
> Thanks,
> Alex.
> 
> 
> 
> 
> 

_______________________________________________
Developers mailing list
[email protected]
http://clusterlabs.org/mailman/listinfo/developers

Reply via email to