Re: Proposal: Distributed Health Monitoring

Robert O Butts Mon, 19 Jul 2021 17:27:29 -0700

> 1) Do you have a description of the failure modes and how TM mitigates
those?


Yes, every Monitor will check that some other Monitor polled every
Cache/CG, and if any didn't get polled, poll it (after the
poll_time*log(distance) to prevent a thundering herd). So if there's a bug
in the algorithm, or the algorithm is changed during an upgrade, or
anything else, that will guarantee all CGs are polled.

The doc also requires override Parameters, so Operators can forcibly set
the CGs to be polled, in an emergency outage scenario.

> 2) Can you say more about using a GSLB to balance between TR load between
TMs? Will a GSLB be a requirement in the immediate future for the updated
TM?

We had some discussions about that. The plan this doc puts forward is _not_
to implement a GSLB, at least for the first "Minimum Viable Product"
iteration, and there are no plans to ever mandate it.

There are some pretty big benefits we could get from GSLB. For example, TR
could always just hit 'health.cachegroup.cdn.example.net' and never need to
know individual Monitor hosts. Likewise, Monitors could use the same when
polling peer Monitors in other Groups. And then the GSLB would handle
load-balancing and round-robin/consistent-hash/etc requests.

But, we realize many ATC operators have small CDNs, and don't want or need
to run a GSLB. So, that'll have to be an additional feature, and we'll
always need to be able to work without a GSLB. So, I think it's a good
future feature that could be its own project. But we don't intend to do
anything around GSLB in this project.


On Mon, Jul 19, 2021 at 1:09 PM Eric Friedrich <fri...@apache.org> wrote:

> Thanks Rob!
>   Two main questions:
> 1) Do you have a description of the failure modes and how TM mitigates
> those?
>
> 2) Can you say more about using a GSLB to balance between TR load between
> TMs? Will a GSLB be a requirement in the immediate future for the updated
> TM?
>
> Thanks,
> Eric
>
> On Thu, Jul 15, 2021 at 2:35 PM Robert O Butts <r...@apache.org> wrote:
>
> > The current Traffic Monitor polls the entire CDN, which doesn't scale
> with
> > large numbers of caches.
> >
> > We've long intended to make it scalable. Several of us have a design
> > proposal. I'm putting it here, just so people don't have to click an
> > external link, but if anyone wants I can also put this on the Apache
> wiki,
> > just ask.
> >
> > We typically do designs with Blueprints in git, but this project is large
> > enough, we felt the need to come up with a higher-level design first.
> Once
> > we have consensus on the design, the intent is to make a blueprint with
> > more technical details, and then seek consensus on that. Design follows.
> >
> >
> > # Implementation
> > - Should not mandate or preclude a deployment environment
> >   - Such as VMs or Containers
> > - This design intentionally minimizes Traffic Ops/Traffic Portal changes,
> > to reduce development cost for the MVP. TO changes may be desirable, and
> > may be implemented in a future iteration.
> >
> > ## Stats
> > The existing Traffic Monitor monitors both stats and health. The existing
> > Monitor will have health polling disabled and will continue to be used
> for
> > stat polling for the immediate future. Beyond that, this document does
> not
> > address stat monitoring.
> >
> > Currently, it’s possible to use stats for health consideration, but it’s
> > unlikely it’s commonly used by anyone.
> >
> > - Should not preclude adding an interface (JSON/HTTP) to send health
> > information (such as from stats) to be used in the health calculation
> >   - For either cache or delivery service health
> >   - This won’t be in the MVP, but is likely to be added in a future
> version
> >
> > ## Health
> > - Must get and use Snapshotted (CRConfig, monitoring.json) data
> > exclusively.
> >   - Must Not request un-snapshotted Traffic Ops endpoints.
> >   - This is because health monitoring must match Router configuration,
> and
> > to prevent potentially multi-state operational changes from being
> deployed
> > until an operator intends.
> > - Monitors must get their own config and data from Traffic Ops without
> > intervention.
> >   - Deployment and running must be Stateless (in terms of persistent
> state;
> > notwithstanding transient health state).
> >   - Local configuration should be limited to Traffic Ops authentication.
> > - Must use Traffic Ops as the Source-Of-Truth for all data about what to
> > poll, but should be agnostic (i.e. handle generic JSON endpoints without
> > special Traffic Ops–specific logic).
> > - Monitors will be divided into “monitor groups,” which will be
> implemented
> > as Traffic Control CacheGroups.
> >   - CacheGroups will be used to minimize Traffic Ops (TO) and Traffic
> > Portal (TP) development in the MVP.
> >   - These Monitor CacheGroups have no correlation with Cache CacheGroups.
> > TO CacheGroups are simply the grouping mechanism.
> >   - Should not preclude a different grouping mechanism.
> > - Monitors will directly poll the health of caches in their designated
> > CacheGroup(s), and will request the health of other CacheGroups from a
> > Monitor in each other Monitor Group.
> > - Monitors will serve the health of all caches in the CDN.
> >   - Both cache health directly polled and received from other monitors.
> > - Monitors must distinguish in their health endpoint (/CrStates) between
> > caches/cachegroups they polled themselves, and health received from other
> > monitors.
> >   - The endpoint should be backwards-compatible with the existing
> /CrStates
> > endpoint used by existing Traffic Routers.
> > - Cache health should be boolean, but must not preclude a more granular
> > score in the future
> >   - For example, 0.0–1.0 or 0–100
> >   - The current Traffic Monitor cache health is boolean
> >   - Note this is a “must,” not a “should,” unlike most extensibility
> > requirements.
> >
> > ## Peering
> > - Monitors must poll all peers in their own Group, and use the health of
> > caches reported by those monitors combined with their own direct polling
> > results to achieve consensus on the health of each cache.
> >   - This is currently implemented in Traffic Monitor, with each Monitor
> > polling all Caches in the CDN.
> >   - The consensus will be Optimistic. If any Monitor directly polling a
> > cache considers it healthy, then it’s considered healthy. This is the
> > current behavior of Traffic Monitor.
> >   - Should not preclude other consensus algorithms.
> > - Monitors must select an arbitrary Monitor in each other Monitor Group
> > each polling interval, to query for the health of CacheGroups not polled
> > directly. This selection must be deterministic and load-balanced.
> >   - For example, deterministic such as via alphabetic hostname.
> >   - For example, load-balanced by round-robin or consistent hash on the
> > deterministic selection order.
> >   - Should not preclude alternate determinism or load-balance algorithms.
> >   - The selected Monitor must be logged, for operational debugging.
> > - The number of CacheGroups each Monitor polls must be the number of
> > CacheGroups in the CDN which contain a cache divided by the number of
> > Monitor Groups.
> >   - Note this means all Monitors in a Group poll the same Caches and
> > CacheGroups
> >   - Note this means the number of Monitors in a Group is the level of
> > consensus.
> >     - For example, if you want to make sure at least 3 Monitors poll
> every
> > Cache, then every Monitor Group must have 3 Monitors.
> > - The CacheGroups polled by each Monitor Group will be the nearest
> > geographically
> >   - Should not preclude an alternate nearest algorithm (for example,
> > network distance)
> > - Note the number and which Caches polled above imply each Monitor must
> be
> > aware of which Caches are being polled by every other Monitor Group.
> >   - For example, if there are 3 Monitor Groups and 10 CacheGroups with
> > Caches, then MonitorGroup1 must poll 3 CacheGroups, MonitorGroup2 must
> poll
> > 3 CacheGroups, and MonitorGroup3 must poll 4 CacheGroups.
> > - For safety, each Monitor must inspect the Peer results from every other
> > Monitor Group, and if any CacheGroup is unpolled, poll that CacheGroup
> > itself.
> >   - This safety should not execute under normal operation. But it may
> > execute during a normal upgrade of the Monitors if the peering algorithm
> > was changed.
> >   - If this occurs, an error must be logged.
> >   - The algorithm to decide to poll an unpolled CacheGroup should be:
> > `time_unpolled * log(nth_distance) > poll_interval`
> >     - Using the logarithm of the poll interval times distance prevents
> > “flapping” and all Monitors in the entire CDN polling a CacheGroup when
> any
> > slight network blip occurs, while still ensuring that any Monitor
> observing
> > an unpolled CacheGroup will poll it after some hard limit (like 10
> > seconds).
> > - For safety, a Profile Parameter which overrides the CacheGroups polled
> by
> > a Monitor must be used if it exists.
> >   - This is a safety to fix outages, for example if there is a bug
> deciding
> > which CacheGroups to poll.
> >   - This should not be used under normal operation.
> > - For safety, a Profile Parameter which overrides the number of
> CacheGroups
> > to poll (normally automatically calculated from the number of
> MonitorGroups
> > divided by CacheGroups) must be used if it exists.
> >   - This is a safety, to fix outages in an emergency. It should not be
> used
> > under normal operation.
> > - All failed polling must retry the selected target, and if the target is
> > an arbitrary group member, an alternative target.
> >   - Retry intervals should be a factor of the poll interval by default.
> >     - For example, if the poll interval is 1 second, retries should not
> > exceed around 3 seconds before marking as failed and stopping retrying.
> >     - Retry intervals should be configurable
> >
> > ## Extensibility
> > The implementation should not preclude:
> > - Using a GSLB for Monitors and clients (Routers) to talk to an arbitrary
> > monitor in a Monitor Group.
> >   - A GSLB provides some distinct advantages, but we need to support
> > non-GSLB for small ATC operators. Therefore, it will be an additional
> > feature, which may be implemented in a future version
> > - Adding more intelligent logic around Optimism, and how consensus is
> > chosen when monitors disagree.
> > - Adding “far” polling, to determine if a cache is reachable from far
> away,
> > and to report that cache as healthy or unhealthy to Traffic Routers or
> > clients far away.
> >   - The present design only implements “near” polling, and does not
> > implement polling from “far” away, to ensure the cache is reachable to
> > clients far away. This is mostly ok, because clients should mostly be
> > requesting nearby Traffic Routers, which should mostly be sending them to
> > nearby caches.
> > However, far-away requests do happen regularly in Production. Thus, we
> > should retain the ability to implement “far health” if necessary in the
> > future.
> >   - Ideally, “near health” would be sent to nearby Traffic Routers or
> > clients, and “far health” would be sent to far clients, reflecting their
> > own distance from the Cache.
> > But that granularity is expensive in both developer-time and performance.
> > - Adding an interface to send health information (such as from stats) to
> be
> > used in the health calculation.
> >
> >
> > Thoughts? Feedback? Votes?
> >
> > Once we have consensus on the list, we'll start on a git blueprint with
> > more technical details, and then also seek consensus on that.
> >
> > Normal Apache voting procedure is 72 hours, but this is large enough, I'd
> > like to give it at least a week, just to give everyone enough time to
> think
> > it through.
> >
> > So if you're interested in this project, please reply within a week with
> > feedback, or let us know if you need more time to think it through and we
> > can extend that time.
> >
> > Thanks,
> >
>

Re: Proposal: Distributed Health Monitoring

Reply via email to