Francesco Romani has submitted this change and it was merged. Change subject: safelease: Increase spmprotect timeouts ......................................................................
safelease: Increase spmprotect timeouts When spmprotect.sh fail to renew the lease, it start a fencing process: 1. Send SIGUSR1 signal to vdsm 2. Send SIGTERM signal to vdsm after 7 seconds 3. Send SIGKILL signal to vdsm after 9 seconds 4. Reboot the machine after 20 seconds When vdsm receives the SIGUSR1 signal, the signal handler invokes stopSpm, which releases the cluster lock. Releasing the cluster lock will terminate the waiting spmprotect.sh, preventing termination of vdsm and reboot. If vdsm fails to release the cluster lock within 7 seconds, spmprotect.sh will terminate it, and if did not terminate, spmprotect.sh will kill it. When systemd starts vdsm again, vdsm looks for spmprotect.sh processes and tries to release the lease. If the lease cannot be released after 10 seconds, it kills the pending spmprotect.sh processes, preventing reboot. Testing with both block and file storage show that this flow is broken when access to master domain is blocked: 1. In block storage, vdsm gets stuck trying to unmount the master mount, and spmprotect.sh kills it before it try to release the cluster lock. 2. In file storage, vdsm gets stuck trying to write spm status to the master domain, and spmprotect.sh kills it before it try to release the cluster lock. 3. When vdsm starts up, sometimes it manage to kill the waiting spmprotect.sh process, and sometimes spmprotect.sh reboot the machine before vdsm kills it. We cannot fix 1 and 2 easily. 3 can be fixed by giving vdsm more time for stopSpm flow, and more time to startup and kill pending spmprotect.sh process. This patch increases spmprotect timeouts to increase the chance of clean shutdown and decrease the chance of unneeded reboot. New spmprotect.sh flow is: 1. Send SIGUSR1 signal to vdsm 2. Send SIGTERM signal to vdsm after 10 seconds 3. Send SIGKILL signal to vdsm after 20 seconds 4. Reboot the machine after 60 seconds Change-Id: Ib71fa06c21602fd9d43516c5b4c997c481708697 Bug-Url: https://bugzilla.redhat.com/1222564 Signed-off-by: Nir Soffer <[email protected]> Reviewed-on: https://gerrit.ovirt.org/46057 Continuous-Integration: Jenkins CI Reviewed-by: Adam Litke <[email protected]> Reviewed-on: https://gerrit.ovirt.org/46332 Reviewed-by: Allon Mureinik <[email protected]> Reviewed-by: Francesco Romani <[email protected]> --- M vdsm/storage/protect/spmprotect.sh.in 1 file changed, 3 insertions(+), 3 deletions(-) Approvals: Nir Soffer: Verified Jenkins CI: Passed CI tests Allon Mureinik: Looks good to me, but someone else must approve Francesco Romani: Looks good to me, approved -- To view, visit https://gerrit.ovirt.org/46332 To unsubscribe, visit https://gerrit.ovirt.org/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ib71fa06c21602fd9d43516c5b4c997c481708697 Gerrit-PatchSet: 3 Gerrit-Project: vdsm Gerrit-Branch: ovirt-3.6 Gerrit-Owner: Nir Soffer <[email protected]> Gerrit-Reviewer: Adam Litke <[email protected]> Gerrit-Reviewer: Allon Mureinik <[email protected]> Gerrit-Reviewer: Dan Kenigsberg <[email protected]> Gerrit-Reviewer: Francesco Romani <[email protected]> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Nir Soffer <[email protected]> Gerrit-Reviewer: Yaniv Bronhaim <[email protected]> Gerrit-Reviewer: [email protected] _______________________________________________ vdsm-patches mailing list [email protected] https://lists.fedorahosted.org/mailman/listinfo/vdsm-patches
