If the lock directory is not removed after failing because of a signal, it won't be possible to acquire the lock anymore before the 120 second timeout imposed on the lock by pmxcfs. This can easily happen by a second, unrelated task in production and is quite surprising. Install a signal handler that releases the lock if it was already acquired. If an old handler is defined, it is invoked, otherwise the signal is raised again. Just using 'die' would change the execution flow compared to before the change.
Signed-off-by: Fiona Ebner <[email protected]> --- src/PVE/Cluster.pm | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/src/PVE/Cluster.pm b/src/PVE/Cluster.pm index bdb465f..7165d1c 100644 --- a/src/PVE/Cluster.pm +++ b/src/PVE/Cluster.pm @@ -615,6 +615,22 @@ my $cfs_lock = sub { my $is_code_err = 0; eval { + # catch signals to release the lock - further defer to old handler if one was set + my $old_sig; + $old_sig->{$_} = $SIG{$_} for qw(INT TERM QUIT HUP PIPE); + + local $SIG{INT} = local $SIG{TERM} = local $SIG{QUIT} = local $SIG{HUP} = + local $SIG{PIPE} = sub { + my $signame = $_[0]; + rmdir $filename if $got_lock; # if we held the lock always unlock again + if ($old_sig->{$signame}) { + $old_sig->{$signame}->(@_); + } else { + $SIG{$signame} = 'DEFAULT'; + POSIX::raise($signame); + } + die "interrupted by signal\n"; + }; mkdir $lockdir; -- 2.47.3
