Proposal: Distributed Health Monitoring

Robert O Butts Thu, 15 Jul 2021 11:35:27 -0700

The current Traffic Monitor polls the entire CDN, which doesn't scale with
large numbers of caches.


We've long intended to make it scalable. Several of us have a design
proposal. I'm putting it here, just so people don't have to click an
external link, but if anyone wants I can also put this on the Apache wiki,
just ask.

We typically do designs with Blueprints in git, but this project is large
enough, we felt the need to come up with a higher-level design first. Once
we have consensus on the design, the intent is to make a blueprint with
more technical details, and then seek consensus on that. Design follows.


# Implementation
- Should not mandate or preclude a deployment environment
  - Such as VMs or Containers
- This design intentionally minimizes Traffic Ops/Traffic Portal changes,
to reduce development cost for the MVP. TO changes may be desirable, and
may be implemented in a future iteration.

## Stats
The existing Traffic Monitor monitors both stats and health. The existing
Monitor will have health polling disabled and will continue to be used for
stat polling for the immediate future. Beyond that, this document does not
address stat monitoring.

Currently, it’s possible to use stats for health consideration, but it’s
unlikely it’s commonly used by anyone.

- Should not preclude adding an interface (JSON/HTTP) to send health
information (such as from stats) to be used in the health calculation
  - For either cache or delivery service health
  - This won’t be in the MVP, but is likely to be added in a future version

## Health
- Must get and use Snapshotted (CRConfig, monitoring.json) data
exclusively.
  - Must Not request un-snapshotted Traffic Ops endpoints.
  - This is because health monitoring must match Router configuration, and
to prevent potentially multi-state operational changes from being deployed
until an operator intends.
- Monitors must get their own config and data from Traffic Ops without
intervention.
  - Deployment and running must be Stateless (in terms of persistent state;
notwithstanding transient health state).
  - Local configuration should be limited to Traffic Ops authentication.
- Must use Traffic Ops as the Source-Of-Truth for all data about what to
poll, but should be agnostic (i.e. handle generic JSON endpoints without
special Traffic Ops–specific logic).
- Monitors will be divided into “monitor groups,” which will be implemented
as Traffic Control CacheGroups.
  - CacheGroups will be used to minimize Traffic Ops (TO) and Traffic
Portal (TP) development in the MVP.
  - These Monitor CacheGroups have no correlation with Cache CacheGroups.
TO CacheGroups are simply the grouping mechanism.
  - Should not preclude a different grouping mechanism.
- Monitors will directly poll the health of caches in their designated
CacheGroup(s), and will request the health of other CacheGroups from a
Monitor in each other Monitor Group.
- Monitors will serve the health of all caches in the CDN.
  - Both cache health directly polled and received from other monitors.
- Monitors must distinguish in their health endpoint (/CrStates) between
caches/cachegroups they polled themselves, and health received from other
monitors.
  - The endpoint should be backwards-compatible with the existing /CrStates
endpoint used by existing Traffic Routers.
- Cache health should be boolean, but must not preclude a more granular
score in the future
  - For example, 0.0–1.0 or 0–100
  - The current Traffic Monitor cache health is boolean
  - Note this is a “must,” not a “should,” unlike most extensibility
requirements.

## Peering
- Monitors must poll all peers in their own Group, and use the health of
caches reported by those monitors combined with their own direct polling
results to achieve consensus on the health of each cache.
  - This is currently implemented in Traffic Monitor, with each Monitor
polling all Caches in the CDN.
  - The consensus will be Optimistic. If any Monitor directly polling a
cache considers it healthy, then it’s considered healthy. This is the
current behavior of Traffic Monitor.
  - Should not preclude other consensus algorithms.
- Monitors must select an arbitrary Monitor in each other Monitor Group
each polling interval, to query for the health of CacheGroups not polled
directly. This selection must be deterministic and load-balanced.
  - For example, deterministic such as via alphabetic hostname.
  - For example, load-balanced by round-robin or consistent hash on the
deterministic selection order.
  - Should not preclude alternate determinism or load-balance algorithms.
  - The selected Monitor must be logged, for operational debugging.
- The number of CacheGroups each Monitor polls must be the number of
CacheGroups in the CDN which contain a cache divided by the number of
Monitor Groups.
  - Note this means all Monitors in a Group poll the same Caches and
CacheGroups
  - Note this means the number of Monitors in a Group is the level of
consensus.
    - For example, if you want to make sure at least 3 Monitors poll every
Cache, then every Monitor Group must have 3 Monitors.
- The CacheGroups polled by each Monitor Group will be the nearest
geographically
  - Should not preclude an alternate nearest algorithm (for example,
network distance)
- Note the number and which Caches polled above imply each Monitor must be
aware of which Caches are being polled by every other Monitor Group.
  - For example, if there are 3 Monitor Groups and 10 CacheGroups with
Caches, then MonitorGroup1 must poll 3 CacheGroups, MonitorGroup2 must poll
3 CacheGroups, and MonitorGroup3 must poll 4 CacheGroups.
- For safety, each Monitor must inspect the Peer results from every other
Monitor Group, and if any CacheGroup is unpolled, poll that CacheGroup
itself.
  - This safety should not execute under normal operation. But it may
execute during a normal upgrade of the Monitors if the peering algorithm
was changed.
  - If this occurs, an error must be logged.
  - The algorithm to decide to poll an unpolled CacheGroup should be:
`time_unpolled * log(nth_distance) > poll_interval`
    - Using the logarithm of the poll interval times distance prevents
“flapping” and all Monitors in the entire CDN polling a CacheGroup when any
slight network blip occurs, while still ensuring that any Monitor observing
an unpolled CacheGroup will poll it after some hard limit (like 10
seconds).
- For safety, a Profile Parameter which overrides the CacheGroups polled by
a Monitor must be used if it exists.
  - This is a safety to fix outages, for example if there is a bug deciding
which CacheGroups to poll.
  - This should not be used under normal operation.
- For safety, a Profile Parameter which overrides the number of CacheGroups
to poll (normally automatically calculated from the number of MonitorGroups
divided by CacheGroups) must be used if it exists.
  - This is a safety, to fix outages in an emergency. It should not be used
under normal operation.
- All failed polling must retry the selected target, and if the target is
an arbitrary group member, an alternative target.
  - Retry intervals should be a factor of the poll interval by default.
    - For example, if the poll interval is 1 second, retries should not
exceed around 3 seconds before marking as failed and stopping retrying.
    - Retry intervals should be configurable

## Extensibility
The implementation should not preclude:
- Using a GSLB for Monitors and clients (Routers) to talk to an arbitrary
monitor in a Monitor Group.
  - A GSLB provides some distinct advantages, but we need to support
non-GSLB for small ATC operators. Therefore, it will be an additional
feature, which may be implemented in a future version
- Adding more intelligent logic around Optimism, and how consensus is
chosen when monitors disagree.
- Adding “far” polling, to determine if a cache is reachable from far away,
and to report that cache as healthy or unhealthy to Traffic Routers or
clients far away.
  - The present design only implements “near” polling, and does not
implement polling from “far” away, to ensure the cache is reachable to
clients far away. This is mostly ok, because clients should mostly be
requesting nearby Traffic Routers, which should mostly be sending them to
nearby caches.
However, far-away requests do happen regularly in Production. Thus, we
should retain the ability to implement “far health” if necessary in the
future.
  - Ideally, “near health” would be sent to nearby Traffic Routers or
clients, and “far health” would be sent to far clients, reflecting their
own distance from the Cache.
But that granularity is expensive in both developer-time and performance.
- Adding an interface to send health information (such as from stats) to be
used in the health calculation.


Thoughts? Feedback? Votes?

Once we have consensus on the list, we'll start on a git blueprint with
more technical details, and then also seek consensus on that.

Normal Apache voting procedure is 72 hours, but this is large enough, I'd
like to give it at least a week, just to give everyone enough time to think
it through.

So if you're interested in this project, please reply within a week with
feedback, or let us know if you need more time to think it through and we
can extend that time.

Thanks,

Proposal: Distributed Health Monitoring

Reply via email to