Nir Soffer has posted comments on this change. Change subject: fencing: Introduce getHostLeaseStatus API ......................................................................
Patch Set 5: From https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace.c#n937 /* * After the lockspace starts, there is a limited amount of * time that we've been watching the other hosts. This means * we can't make an accurate assessment of their state, because * the state is based on monitoring the hosts for host_fail_seconds * and host_dead_seconds, or seeing a renewal. When none of * those are true (not enough time monitoring and not seeing a * renewal), we return UNKNOWN. * * (Example number of seconds below are based on hosts using the * default 10 second io timeout.) * * * For hosts that are alive when we start, we return: * UNKNOWN then LIVE * * UNKNOWN would typically last for 10-20 seconds, but it's possible that * UNKNOWN could persist for up to 80 seconds before LIVE is returned. * LIVE is returned after we see the timestamp change once. * * * For hosts that are dead when we start, we'd return: * UNKNOWN then FAIL then DEAD * * UNKNOWN would last for 80 seconds before we return FAIL. * FAIL would last for 60 more seconds before we return DEAD. * * * Hosts that are failing and don't recover would be the same as prev. * * * For hosts thet are failing but recover, we'd return: * UNKNOWN then FAIL then LIVE * * * For another host that is alive when we start, * the sequence of values is: * * 0: we have not yet called check_other_leases() * first_check = 0, last_check = 0, last_live = 0 * * other host renews its lease * * 10: we call check_other_leases() for the first time, * first_check = 10, last_check = 10, last_live = 10 * * other host renews its lease * * 20: we call check_other_leases() for the second time, * first_check = 10, last_check = 20, last_live = 20 * * At 10, we have not yet seen a renewal from the other host, i.e. we have * not seen its timestamp change (we only have one sample). The host could * be dead or alive, so we set the state to UNKNOWN. The way we know * that we have not yet observed the timestamp change is that * first_check == last_live, (10 == 10). * * At 20, we have seen a renewal, i.e. the timestamp changed between checks, * so we return LIVE. * * In the other case, if the host was actually dead, not alive, it would not * have renewed between 10 and 20. So at 20 we would continue to see * first_check == last_live, and would return UNKNOWN. If the host remains * dead, we'd continue to report UNKNOWN for the first 80 seconds. * After 80 seconds, we'd return FAIL. After 140 seconds we'd return DEAD. */ I think we need to take couple of samples, starting when we suspect that a host is not healthy, and finishing at least 80 seconds later. Then when we fence a host, we can look at the samples and apply various policies. The simplest policy would be to avoid fencing if a host is live on at least one domain after waiting 80 seconds or so. Maybe we should move this discussion to the mailing list? -- To view, visit http://gerrit.ovirt.org/28873 To unsubscribe, visit http://gerrit.ovirt.org/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iccd62e58a194aa0ceb0f5e2503b8ec7e4349971b Gerrit-PatchSet: 5 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Nir Soffer <[email protected]> Gerrit-Reviewer: Allon Mureinik <[email protected]> Gerrit-Reviewer: Barak Azulay <[email protected]> Gerrit-Reviewer: Dan Kenigsberg <[email protected]> Gerrit-Reviewer: Federico Simoncelli <[email protected]> Gerrit-Reviewer: Itamar Heim <[email protected]> Gerrit-Reviewer: Nir Soffer <[email protected]> Gerrit-Reviewer: Piotr Kliczewski <[email protected]> Gerrit-Reviewer: Saggi Mizrahi <[email protected]> Gerrit-Reviewer: Xavi Francisco <[email protected]> Gerrit-Reviewer: Yoav Kleinberger <[email protected]> Gerrit-Reviewer: [email protected] Gerrit-Reviewer: oVirt Jenkins CI Server Gerrit-HasComments: No _______________________________________________ vdsm-patches mailing list [email protected] https://lists.fedorahosted.org/mailman/listinfo/vdsm-patches
