On 11/03/20 09:04 -0500, Ken Gaillot wrote: > On Wed, 2020-03-11 at 08:20 +0100, Ulrich Windl wrote: >> You only have to take care not to compare CLOCK_MONOTONIC >> timestamps between nodes or node restarts. > > Definitely :) > > They are used only to calculate action queue and run durations
Both these ... from an isolated perspective of a single node only. E.g., run durations related to the one currently responsible to act upon the resource in some way (the "atomic" operation is always bound to the single host context and when retried or logically followed with another operation, it's measured anew on pertaining, perhaps different node). I feel that's a rather important detail, and just recently this surface received some slight scratching on the conceptual level... Current inability to synchronize measurements of CLOCK_MONOTONIC like notions of time amongst nodes (especially tranfer from old, possibly failed DC to new DC, likely involving some admitted loss of precisenesss -- mind you, cluster is never fully synchronous, you'd need the help of specialized HW for that) in as lossless way as possible is what I believe is the main show stopper for being able to accurately express the actual "availability score" for given resource or resource group --- yep, that famous number, the holy grail of anyone taking HA seriously --- while at the same time, something the cluster stack currently cannot readily present to users (despite it having all or most of the relevant information, just piecewise). IOW, this sort of non-localized measurement is what asks for emulation of cluster-wide CLOCK_MONOTONIC-like measurement, which is not that trivial if you think about it. Sort of a corollary of what Ulrich said, because emulating that pushes you exactly in these waters of relating CLOCK_MONOTONIC measurements from different nodes together. Not to speak of evaluating whether any node is totally off in its own CLOCK_MONOTONIC measurements and hence shall rather be fenced as "brain damaged", and perhaps even using the measurements of the nodes keeping up together to somehow calculate what's the average rate of measured time progress so as to self-maintain time-bound cluster-wide integrity, which may just as well be important for sbd(!). (nope, this doesn't get anywhere close to near-light speed concerns, just imprecise HW and possibly implied/or inter-VM differences) Perhaps cheapest way out would be to use NTP-level algorithms to synchronize two CLOCK_MONOTIC timers at the point the worker node for resource in question claimed "resource stopped", between this worker node and DC, so that the DC can synchronize again like that with a new worker node at the point in time when this new claims "resource started". At that point, DC would have a rather accurate knowledge of how long this fail-/move-over, hence down-time, lasted, hence being able to reflect it to the "availability score" equations. Hmm, no wonder that businesses with deep pockets and serious synchronicity requirements across the globe resort to using atomic clocks, incredibly precise CLOCK_MONOTONIC by default :-) > For most resource types those are optional (for reporting only), but > systemd resources require them (multiple status checks are usually > necessary to verify a start or stop worked, and we need to check the > remaining timeout each time). Coincidentally, IIRC systemd alone strictly requires CLOCK_MONOTIC (and we shall get a lot more strict as well to provide reasonable expectations to the users as mentioned recently[*]), so said requirement is just a logical extension without corner cases. [*] https://lists.clusterlabs.org/pipermail/users/2019-November/026647.html -- Jan (Poki)
pgp88itDpsGVE.pgp
Description: PGP signature
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/