> On 24 May 2016, at 2:23 AM, Alex Lyakas <[email protected]> wrote:
>
> Hello Andrew,
>
> We have a system in the field running a MASTER-SLAVE resource on two nodes.
> We are trying to upgrade the pacemaker on these two nodes. First we upgrade
> the SLAVE node. Then we move the resource to be MASTER on the upgraded SLAVE
> node (“crm node standby” on the old MASTER). This move involves cancelling a
> monitor operation on the SLAVE node.
>
> With commit
> https://github.com/ClusterLabs/pacemaker/commit/abcdaa8893d6071574986af6abc85ae558473735
> there is a change of how the “cancel” action is confirmed.
>
> Previously, send_direct_ack was always used to confirm the cancel action. But
> now, the cancel action is being confirmed not by direct ACK but by parsing
> the XML.
Oh, and you’re mixing pacemaker versions.
I can see how that would be a problem.
Are you seeing this in the process of upgrading the entire cluster is the plan
just to update one?
>
> So the new node receives the cancel action, but doesn’t call send_direct_ack.
> As a result on the old node, it sends the cancel action:
> May 23 18:05:49 vsa-000001be-vc-0 crmd: [3089]: info: te_rsc_command:
> Initiating action 4: cancel VAM:1_monitor_5000 on vsa-000001be-vc-1
>
> And after 3 minutes only it moves forward due to timeout
> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: WARN: action_timer_callback:
> Timer popped (timeout=120000, abort_level=1000000, complete=false)
> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: ERROR: print_elem: Aborting
> transition, action lost: [Action 4]: In-flight (id: VAM:1_monitor_5000, loc:
> vsa-000001be-vc-1, priority: 0)
> May 23 18:08:49 vsa-000001be-vc-0 crmd: [3089]: info: abort_transition_graph:
> action_timer_callback:512 - Triggered transition abort (complete=0) : Action
> lost
>
> However, the 3 minute-timeout is unacceptable for our customers.
>
> What would you recommend to fix this backward compatibility issue?
Unfortunately, you might need to resort to the detach+upgrade
everything+reattach method of upgrading as described here:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_disconnect_and_reattach.html
<http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Pacemaker_Explained/_disconnect_and_reattach.html>
>
> Only as a test, I called send_direct_ack in case “in_progress==TRUE” also.
> This fixed the problem, as the older node received the needed ACK. But I
> don’t know what this change might break.
It would probably be fine as a transition plan.
Ie. first do a rolling update to the patched version, then another to the
unpatched version.
>
> Thanks,
> Alex.
>
>
>
>
>
_______________________________________________
Developers mailing list
[email protected]
http://clusterlabs.org/mailman/listinfo/developers