Am 23.02.26 um 11:56 AM schrieb Dominik Csapak: > There are two ways a cleanup can be triggered: > > * When a guest is stopped/shutdown via the API, 'vm_stop' calls > 'vm_stop_cleanup'. > * When the guest process disconnects from qmeventd, 'qm cleanup' is > called, which in turn also tries to call 'vm_stop_cleanup'. > > Both of these happen under a qemu config lock, so there is no direct > race condition that it will be called out of order, but it could happen > that the 'qm cleanup' call happened in addition so cleanup was called > twice. Which could be a problem when the shutdown was called with > 'keepActive' which 'qm cleanup' would simply know nothing of and ignore. > > Also the post-stop hook might not be triggered in case e.g. a stop-mode > backup was done, since that was only happening via qm cleanup and this > would detect the now again running guest and abort. > > To improve the situation we move the exec_hookscript call at the end > of vm_stop_cleanup. At this point we know the vm is stopped and we still > have the config lock. > > On _do_vm_stop (and in the one case for migration) a 'cleanup-flag' is > created that marks the vm is cleaned up by the api, so 'qm cleanup' > should not do it again. > > On vm start, this flag is cleared. > > There is still a tiny race possible: > > a guest is stopped from within (or crashes) and the vm is started again > via the api before 'qm cleanup' can run > > This should be a very rare case though, and all operation via the API > (reboot, shutdown+start, stop-mode backup, etc.) should work as intended.
How difficult is it to trigger the race with an HA-managed VM? > Signed-off-by: Dominik Csapak <[email protected]> > --- > I'm not sure how we could possibly eliminate the race i mentioned: > * we can't block on start because i don't think we can sensibly decide > between: > - vm was crashing/powered off > - vm was never started > > We could maybe leave a 'started' flag somewhere too and clear the > cleanup flag also in 'qm cleanup', then we would start the vm > only when the cleanup flag is cleared > (or have the cleanup flags have multiple states, like 'started', > 'finished') > > Though this has some regression potential i think... > > src/PVE/CLI/qm.pm | 6 +++--- > src/PVE/QemuServer.pm | 6 ++++++ > src/PVE/QemuServer/Helpers.pm | 23 +++++++++++++++++++++++ > src/test/MigrationTest/QmMock.pm | 9 +++++++++ > 4 files changed, 41 insertions(+), 3 deletions(-) > > diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm > index bdae9641..f31d2ee0 100755 > --- a/src/PVE/CLI/qm.pm > +++ b/src/PVE/CLI/qm.pm > @@ -1120,13 +1120,13 @@ __PACKAGE__->register_method({ > } > } > > - if (!$clean || $guest) { > + if (PVE::QemuServer::Helpers::cleanup_flag_exists($vmid)) { > + warn "Cleanup already done for $vmid, skipping...\n"; > + } else { > # vm was shutdown from inside the guest or crashed, > doing api cleanup > PVE::QemuServer::vm_stop_cleanup($storecfg, $vmid, > $conf, 0, 0, 1); > } > > - PVE::GuestHelpers::exec_hookscript($conf, $vmid, > 'post-stop'); > - > $restart = eval { > PVE::QemuServer::clear_reboot_request($vmid) }; > warn $@ if $@; > }, > diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm > index 545758dc..1253b601 100644 > --- a/src/PVE/QemuServer.pm > +++ b/src/PVE/QemuServer.pm > @@ -5439,6 +5439,8 @@ sub vm_start { > return PVE::QemuConfig->lock_config( > $vmid, > sub { > + PVE::QemuServer::Helpers::clear_cleanup_flag($vmid); > + > my $conf = PVE::QemuConfig->load_config($vmid, > $migrate_opts->{migratedfrom}); > > die "you can't start a vm if it's a template\n" > @@ -6158,6 +6160,7 @@ sub vm_stop_cleanup { > die $err if !$noerr; > warn $err; > } > + PVE::GuestHelpers::exec_hookscript($conf, $vmid, 'post-stop'); > } > > # call only in locked context > @@ -6167,6 +6170,8 @@ sub _do_vm_stop { > my $pid = check_running($vmid, $nocheck); > return if !$pid; > > + PVE::QemuServer::Helpers::create_cleanup_flag($vmid); We don't do the cleanup if $nohcheck=1, so we may not create the flag then. > + > my $conf; > if (!$nocheck) { > $conf = PVE::QemuConfig->load_config($vmid); > @@ -6261,6 +6266,7 @@ sub vm_stop { > $force = 1 if !defined($force) && !$shutdown; > > if ($migratedfrom) { > + PVE::QemuServer::Helpers::create_cleanup_flag($vmid); > my $pid = check_running($vmid, $nocheck, $migratedfrom); > kill 15, $pid if $pid; > my $conf = PVE::QemuConfig->load_config($vmid, $migratedfrom); > diff --git a/src/PVE/QemuServer/Helpers.pm b/src/PVE/QemuServer/Helpers.pm > index 35c00754..79277dd6 100644 > --- a/src/PVE/QemuServer/Helpers.pm > +++ b/src/PVE/QemuServer/Helpers.pm > @@ -6,8 +6,10 @@ use warnings; > use File::stat; > use IO::File; > use JSON; > +use POSIX qw(); > > use PVE::Cluster; > +use PVE::File; > use PVE::INotify; > use PVE::ProcFSTools; > use PVE::Tools qw(get_host_arch); > @@ -363,4 +365,25 @@ sub get_iscsi_initiator_name { > return $initiator; > } > > +my sub get_cleanup_path { > + my ($vmid) = @_; > + return "/var/run/qemu-server/$vmid.cleanup"; > +} > + > +sub create_cleanup_flag { > + my ($vmid) = @_; > + PVE::File::file_set_contents(get_cleanup_path($vmid), time()); > +} > + > +sub clear_cleanup_flag { > + my ($vmid) = @_; > + my $path = get_cleanup_path($vmid); > + unlink $path or $! == POSIX::ENOENT or die "removing cleanup flag for > $vmid failed: $!\n"; > +} > + > +sub cleanup_flag_exists { > + my ($vmid) = @_; > + return -f get_cleanup_path($vmid); > +} > + Maybe those helpers are better placed in the RunState module rather than the generic Helper module? > 1; > diff --git a/src/test/MigrationTest/QmMock.pm > b/src/test/MigrationTest/QmMock.pm > index 78be47d3..7d514644 100644 > --- a/src/test/MigrationTest/QmMock.pm > +++ b/src/test/MigrationTest/QmMock.pm > @@ -75,6 +75,15 @@ $qemu_server_helpers_module->mock( > vm_running_locally => sub { > return $kvm_exectued; > }, > + create_cleanup_flag => sub { > + return undef; > + }, > + clear_cleanup_flag => sub { > + return undef; > + }, > + cleanup_flag_exists => sub { > + return ''; # what -f returns if a file does not exist > + }, > ); > > our $qemu_server_machine_module = > Test::MockModule->new("PVE::QemuServer::Machine");
