Our cluster filesystem is for a short time offline when its package (pve-cluster) gets updated.
If them LRM or CRM call the get_protected_ha_*_lock during such a time it run into this check and died. As a result we assumed that we lost our lock and change in the 'lost_agent_lock' state. Then the watchdog updates were stopped to allow selfencing. This fencing is completely unnecessary, so retry again instead, up to 5 times. After that we can assume that pmxcfs is dead and swicth in the 'lost_agent_lock' state. Signed-off-by: Thomas Lamprecht <t.lampre...@proxmox.com> --- changes since v1: * set $retry instead of a sensless return as we are in a eval... src/PVE/HA/Env/PVE2.pm | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm index 6b8802e..3141e1e 100644 --- a/src/PVE/HA/Env/PVE2.pm +++ b/src/PVE/HA/Env/PVE2.pm @@ -227,7 +227,10 @@ sub get_pve_lock { mkdir $lockdir; # pve cluster filesystem not online - die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir; + if (! -d $lockdir) { + $retry = 1; + die "can't create '$lockdir' (pmxcfs not mounted?)\n"; + } if ($last && (($ctime - $last) < $retry_timeout)) { # send cfs lock update request (utime) -- 2.1.4 _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel