Re: Proposal: Distributed Health Monitoring

Eric Friedrich Mon, 19 Jul 2021 12:09:41 -0700

Thanks Rob!
  Two main questions:
1) Do you have a description of the failure modes and how TM mitigates
those?


2) Can you say more about using a GSLB to balance between TR load between
TMs? Will a GSLB be a requirement in the immediate future for the updated
TM?

Thanks,
Eric

On Thu, Jul 15, 2021 at 2:35 PM Robert O Butts <r...@apache.org> wrote:

> The current Traffic Monitor polls the entire CDN, which doesn't scale with
> large numbers of caches.
>
> We've long intended to make it scalable. Several of us have a design
> proposal. I'm putting it here, just so people don't have to click an
> external link, but if anyone wants I can also put this on the Apache wiki,
> just ask.
>
> We typically do designs with Blueprints in git, but this project is large
> enough, we felt the need to come up with a higher-level design first. Once
> we have consensus on the design, the intent is to make a blueprint with
> more technical details, and then seek consensus on that. Design follows.
>
>
> # Implementation
> - Should not mandate or preclude a deployment environment
>   - Such as VMs or Containers
> - This design intentionally minimizes Traffic Ops/Traffic Portal changes,
> to reduce development cost for the MVP. TO changes may be desirable, and
> may be implemented in a future iteration.
>
> ## Stats
> The existing Traffic Monitor monitors both stats and health. The existing
> Monitor will have health polling disabled and will continue to be used for
> stat polling for the immediate future. Beyond that, this document does not
> address stat monitoring.
>
> Currently, it’s possible to use stats for health consideration, but it’s
> unlikely it’s commonly used by anyone.
>
> - Should not preclude adding an interface (JSON/HTTP) to send health
> information (such as from stats) to be used in the health calculation
>   - For either cache or delivery service health
>   - This won’t be in the MVP, but is likely to be added in a future version
>
> ## Health
> - Must get and use Snapshotted (CRConfig, monitoring.json) data
> exclusively.
>   - Must Not request un-snapshotted Traffic Ops endpoints.
>   - This is because health monitoring must match Router configuration, and
> to prevent potentially multi-state operational changes from being deployed
> until an operator intends.
> - Monitors must get their own config and data from Traffic Ops without
> intervention.
>   - Deployment and running must be Stateless (in terms of persistent state;
> notwithstanding transient health state).
>   - Local configuration should be limited to Traffic Ops authentication.
> - Must use Traffic Ops as the Source-Of-Truth for all data about what to
> poll, but should be agnostic (i.e. handle generic JSON endpoints without
> special Traffic Ops–specific logic).
> - Monitors will be divided into “monitor groups,” which will be implemented
> as Traffic Control CacheGroups.
>   - CacheGroups will be used to minimize Traffic Ops (TO) and Traffic
> Portal (TP) development in the MVP.
>   - These Monitor CacheGroups have no correlation with Cache CacheGroups.
> TO CacheGroups are simply the grouping mechanism.
>   - Should not preclude a different grouping mechanism.
> - Monitors will directly poll the health of caches in their designated
> CacheGroup(s), and will request the health of other CacheGroups from a
> Monitor in each other Monitor Group.
> - Monitors will serve the health of all caches in the CDN.
>   - Both cache health directly polled and received from other monitors.
> - Monitors must distinguish in their health endpoint (/CrStates) between
> caches/cachegroups they polled themselves, and health received from other
> monitors.
>   - The endpoint should be backwards-compatible with the existing /CrStates
> endpoint used by existing Traffic Routers.
> - Cache health should be boolean, but must not preclude a more granular
> score in the future
>   - For example, 0.0–1.0 or 0–100
>   - The current Traffic Monitor cache health is boolean
>   - Note this is a “must,” not a “should,” unlike most extensibility
> requirements.
>
> ## Peering
> - Monitors must poll all peers in their own Group, and use the health of
> caches reported by those monitors combined with their own direct polling
> results to achieve consensus on the health of each cache.
>   - This is currently implemented in Traffic Monitor, with each Monitor
> polling all Caches in the CDN.
>   - The consensus will be Optimistic. If any Monitor directly polling a
> cache considers it healthy, then it’s considered healthy. This is the
> current behavior of Traffic Monitor.
>   - Should not preclude other consensus algorithms.
> - Monitors must select an arbitrary Monitor in each other Monitor Group
> each polling interval, to query for the health of CacheGroups not polled
> directly. This selection must be deterministic and load-balanced.
>   - For example, deterministic such as via alphabetic hostname.
>   - For example, load-balanced by round-robin or consistent hash on the
> deterministic selection order.
>   - Should not preclude alternate determinism or load-balance algorithms.
>   - The selected Monitor must be logged, for operational debugging.
> - The number of CacheGroups each Monitor polls must be the number of
> CacheGroups in the CDN which contain a cache divided by the number of
> Monitor Groups.
>   - Note this means all Monitors in a Group poll the same Caches and
> CacheGroups
>   - Note this means the number of Monitors in a Group is the level of
> consensus.
>     - For example, if you want to make sure at least 3 Monitors poll every
> Cache, then every Monitor Group must have 3 Monitors.
> - The CacheGroups polled by each Monitor Group will be the nearest
> geographically
>   - Should not preclude an alternate nearest algorithm (for example,
> network distance)
> - Note the number and which Caches polled above imply each Monitor must be
> aware of which Caches are being polled by every other Monitor Group.
>   - For example, if there are 3 Monitor Groups and 10 CacheGroups with
> Caches, then MonitorGroup1 must poll 3 CacheGroups, MonitorGroup2 must poll
> 3 CacheGroups, and MonitorGroup3 must poll 4 CacheGroups.
> - For safety, each Monitor must inspect the Peer results from every other
> Monitor Group, and if any CacheGroup is unpolled, poll that CacheGroup
> itself.
>   - This safety should not execute under normal operation. But it may
> execute during a normal upgrade of the Monitors if the peering algorithm
> was changed.
>   - If this occurs, an error must be logged.
>   - The algorithm to decide to poll an unpolled CacheGroup should be:
> `time_unpolled * log(nth_distance) > poll_interval`
>     - Using the logarithm of the poll interval times distance prevents
> “flapping” and all Monitors in the entire CDN polling a CacheGroup when any
> slight network blip occurs, while still ensuring that any Monitor observing
> an unpolled CacheGroup will poll it after some hard limit (like 10
> seconds).
> - For safety, a Profile Parameter which overrides the CacheGroups polled by
> a Monitor must be used if it exists.
>   - This is a safety to fix outages, for example if there is a bug deciding
> which CacheGroups to poll.
>   - This should not be used under normal operation.
> - For safety, a Profile Parameter which overrides the number of CacheGroups
> to poll (normally automatically calculated from the number of MonitorGroups
> divided by CacheGroups) must be used if it exists.
>   - This is a safety, to fix outages in an emergency. It should not be used
> under normal operation.
> - All failed polling must retry the selected target, and if the target is
> an arbitrary group member, an alternative target.
>   - Retry intervals should be a factor of the poll interval by default.
>     - For example, if the poll interval is 1 second, retries should not
> exceed around 3 seconds before marking as failed and stopping retrying.
>     - Retry intervals should be configurable
>
> ## Extensibility
> The implementation should not preclude:
> - Using a GSLB for Monitors and clients (Routers) to talk to an arbitrary
> monitor in a Monitor Group.
>   - A GSLB provides some distinct advantages, but we need to support
> non-GSLB for small ATC operators. Therefore, it will be an additional
> feature, which may be implemented in a future version
> - Adding more intelligent logic around Optimism, and how consensus is
> chosen when monitors disagree.
> - Adding “far” polling, to determine if a cache is reachable from far away,
> and to report that cache as healthy or unhealthy to Traffic Routers or
> clients far away.
>   - The present design only implements “near” polling, and does not
> implement polling from “far” away, to ensure the cache is reachable to
> clients far away. This is mostly ok, because clients should mostly be
> requesting nearby Traffic Routers, which should mostly be sending them to
> nearby caches.
> However, far-away requests do happen regularly in Production. Thus, we
> should retain the ability to implement “far health” if necessary in the
> future.
>   - Ideally, “near health” would be sent to nearby Traffic Routers or
> clients, and “far health” would be sent to far clients, reflecting their
> own distance from the Cache.
> But that granularity is expensive in both developer-time and performance.
> - Adding an interface to send health information (such as from stats) to be
> used in the health calculation.
>
>
> Thoughts? Feedback? Votes?
>
> Once we have consensus on the list, we'll start on a git blueprint with
> more technical details, and then also seek consensus on that.
>
> Normal Apache voting procedure is 72 hours, but this is large enough, I'd
> like to give it at least a week, just to give everyone enough time to think
> it through.
>
> So if you're interested in this project, please reply within a week with
> feedback, or let us know if you need more time to think it through and we
> can extend that time.
>
> Thanks,
>

Re: Proposal: Distributed Health Monitoring

Reply via email to