Description of the problem, imagine the following: We get the CRM command to migrate 'vm:100' from A to B. Now when the migration fails, we would normally get placed in the started state on the source node A trough the CRM when it processes our result. But if the CRM didn't processed our result before we start a new 'manage_resources' round (we do that about all ~ 5 seconds) then it could be that the LRM restarts a migration try with the CRM not knowing anything and worse the CRM may process the result of the failed migration try at the same time and place it to started on node A while the LRM now successfully migrated the service to B with the second (hidden) try. Now the state is out of sync:
CRM has the service marked as started on node A but it runs on node B. We (currently) have no way to fixup a wrong node location of a _running_ service, thus the LRM from node A errors in EWRONG_NODE and the CRM places the service in the error state. To fix that we _never_ execute two exactly same migrate commands after each other, exactly means here that the SID and the target are the same while the command is either 'migrate' or 'relocate'. Signed-off-by: Thomas Lamprecht <t.lampre...@proxmox.com> --- changes since v1: * check for migrate and relocate, relocate is normally less error prone (simply as it suceeds more likely) but dangerous nontheless * improved commit message and comment in code src/PVE/HA/LRM.pm | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm index f53f26d..e81ae3e 100644 --- a/src/PVE/HA/LRM.pm +++ b/src/PVE/HA/LRM.pm @@ -457,6 +457,15 @@ sub queue_resource_command { if (my $w = $self->{workers}->{$sid}) { return if $w->{pid}; # already started + if (($state eq 'migrate' || $state eq 'relocate') && + $w->{state} eq $state && $w->{target} eq $target) { + # ignore two identical migration/relocation tries directly after each + # other as this means that the CRM didn't got (and processed) our + # result yet and a second migration try can be dangerous (EWRONG_NODE)! + $self->{haenv}->log('notice', "Ignore second identical migration call," . + " CRM didn't processed our last result yet."); + return; + } # else, delete and overwrite queue entry with new command delete $self->{workers}->{$sid}; } -- 2.1.4 _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel