Our cluster filesystem is for a short time offline when its package
(pve-cluster) gets updated.

If them LRM or CRM call the get_protected_ha_*_lock during such a
time it run into this check and died.
As a result we assumed that we lost our lock and change in the
'lost_agent_lock' state. Then the watchdog updates were stopped to
allow selfencing.

This fencing is completely unnecessary, so retry again instead, up to
5 times. After that we can assume that pmxcfs is dead and swicth in
the 'lost_agent_lock' state.

Signed-off-by: Thomas Lamprecht <t.lampre...@proxmox.com>
---

changes since v1:
* set $retry instead of a sensless return as we are in a eval...


 src/PVE/HA/Env/PVE2.pm | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm
index 6b8802e..3141e1e 100644
--- a/src/PVE/HA/Env/PVE2.pm
+++ b/src/PVE/HA/Env/PVE2.pm
@@ -227,7 +227,10 @@ sub get_pve_lock {
        mkdir $lockdir;
 
        # pve cluster filesystem not online
-       die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir;
+       if (! -d $lockdir) {
+           $retry = 1;
+           die "can't create '$lockdir' (pmxcfs not mounted?)\n";
+       }
 
        if ($last && (($ctime - $last) < $retry_timeout)) {
             # send cfs lock update request (utime)
-- 
2.1.4


_______________________________________________
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to