Am 18.02.26 um 7:33 PM schrieb Thomas Lamprecht:
> Am 18.02.26 um 16:45 schrieb Fiona Ebner:
>> If the lock directory is not removed after failing because of a
>> signal, it won't be possible to acquire the lock anymore before the
>> 120 second timeout imposed on the lock by pmxcfs. This can easily
>> happen by a second, unrelated task in production and is quite
>> surprising. Install a signal handler that releases the lock if it was
>> already acquired. If an old handler is defined, it is invoked,
>> otherwise the signal is raised again. Just using 'die' would change
>> the execution flow compared to before the change.
>>
>> Signed-off-by: Fiona Ebner <[email protected]>
>> ---
>> src/PVE/Cluster.pm | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>>
>> diff --git a/src/PVE/Cluster.pm b/src/PVE/Cluster.pm
>> index bdb465f..7165d1c 100644
>> --- a/src/PVE/Cluster.pm
>> +++ b/src/PVE/Cluster.pm
>> @@ -615,6 +615,22 @@ my $cfs_lock = sub {
>>
>> my $is_code_err = 0;
>> eval {
>> + # catch signals to release the lock - further defer to old handler
>> if one was set
>> + my $old_sig;
>> + $old_sig->{$_} = $SIG{$_} for qw(INT TERM QUIT HUP PIPE);
>
> really a non-issue in practice and basically the same thing under the hood,
> but
> this could probably just a map, something like (untested):
>
> my $old_sig = { map { $_ => $SIG{$_} qw(INT TERM QUIT HUP PIPE) };
Will do!
>> +
>> + local $SIG{INT} = local $SIG{TERM} = local $SIG{QUIT} = local
>> $SIG{HUP} =
>> + local $SIG{PIPE} = sub {
>> + my $signame = $_[0];
>> + rmdir $filename if $got_lock; # if we held the lock always
>> unlock again
>
> Could be nice to output a warning if above rmdir fails?
Good point! Will also add it to the original line I copied this from.
>> + if ($old_sig->{$signame}) {
>> + $old_sig->{$signame}->(@_);
>> + } else {
>> + $SIG{$signame} = 'DEFAULT';
>> + POSIX::raise($signame);
>
> hmm, this reads alright, but then I'm wondering if it should be added
> elsewhere?
> As I found not a single "POSIX::raise" or "raise\(" instance in our perl code
> inside the /usr/share/perl5/{PVE,Proxmox} directories on a recent PVE 9
> system, but
> we have quite a few signal overrides, and while I did not checked those, I do
> believe
> to remember that some of those fallback to the handler defined by the calling
> site.
The only ones I found that do invoke the previous handler are in
PVE::Daemon. They also do not use raise, but terminate the server.
For some other ones it's most likely intentional to convert the signal
to a simple die. For example PVE:VZDump::QemuServer, where it makes
sense to just catch the signal and proceed with aborting the backup
rather than raise it again.
Compared to those, cfs_lock() is quite low in the call chains and there
are callers that just warn about an error from cfs_lock(). So while it
is essential to not convert a signal to a simple die in cfs_lock(), it
might not be for other current signal overrides.
> Describing how exactly the code flow changes would be nice in any case.
Do you mean expanding on the sentence mentioning "code flow" in the
commit message or something else?
>> + }
>> + die "interrupted by signal\n";
>> + };
>>
>> mkdir $lockdir;
>>