Thanks Rob! Two main questions: 1) Do you have a description of the failure modes and how TM mitigates those?
2) Can you say more about using a GSLB to balance between TR load between TMs? Will a GSLB be a requirement in the immediate future for the updated TM? Thanks, Eric On Thu, Jul 15, 2021 at 2:35 PM Robert O Butts <r...@apache.org> wrote: > The current Traffic Monitor polls the entire CDN, which doesn't scale with > large numbers of caches. > > We've long intended to make it scalable. Several of us have a design > proposal. I'm putting it here, just so people don't have to click an > external link, but if anyone wants I can also put this on the Apache wiki, > just ask. > > We typically do designs with Blueprints in git, but this project is large > enough, we felt the need to come up with a higher-level design first. Once > we have consensus on the design, the intent is to make a blueprint with > more technical details, and then seek consensus on that. Design follows. > > > # Implementation > - Should not mandate or preclude a deployment environment > - Such as VMs or Containers > - This design intentionally minimizes Traffic Ops/Traffic Portal changes, > to reduce development cost for the MVP. TO changes may be desirable, and > may be implemented in a future iteration. > > ## Stats > The existing Traffic Monitor monitors both stats and health. The existing > Monitor will have health polling disabled and will continue to be used for > stat polling for the immediate future. Beyond that, this document does not > address stat monitoring. > > Currently, it’s possible to use stats for health consideration, but it’s > unlikely it’s commonly used by anyone. > > - Should not preclude adding an interface (JSON/HTTP) to send health > information (such as from stats) to be used in the health calculation > - For either cache or delivery service health > - This won’t be in the MVP, but is likely to be added in a future version > > ## Health > - Must get and use Snapshotted (CRConfig, monitoring.json) data > exclusively. > - Must Not request un-snapshotted Traffic Ops endpoints. > - This is because health monitoring must match Router configuration, and > to prevent potentially multi-state operational changes from being deployed > until an operator intends. > - Monitors must get their own config and data from Traffic Ops without > intervention. > - Deployment and running must be Stateless (in terms of persistent state; > notwithstanding transient health state). > - Local configuration should be limited to Traffic Ops authentication. > - Must use Traffic Ops as the Source-Of-Truth for all data about what to > poll, but should be agnostic (i.e. handle generic JSON endpoints without > special Traffic Ops–specific logic). > - Monitors will be divided into “monitor groups,” which will be implemented > as Traffic Control CacheGroups. > - CacheGroups will be used to minimize Traffic Ops (TO) and Traffic > Portal (TP) development in the MVP. > - These Monitor CacheGroups have no correlation with Cache CacheGroups. > TO CacheGroups are simply the grouping mechanism. > - Should not preclude a different grouping mechanism. > - Monitors will directly poll the health of caches in their designated > CacheGroup(s), and will request the health of other CacheGroups from a > Monitor in each other Monitor Group. > - Monitors will serve the health of all caches in the CDN. > - Both cache health directly polled and received from other monitors. > - Monitors must distinguish in their health endpoint (/CrStates) between > caches/cachegroups they polled themselves, and health received from other > monitors. > - The endpoint should be backwards-compatible with the existing /CrStates > endpoint used by existing Traffic Routers. > - Cache health should be boolean, but must not preclude a more granular > score in the future > - For example, 0.0–1.0 or 0–100 > - The current Traffic Monitor cache health is boolean > - Note this is a “must,” not a “should,” unlike most extensibility > requirements. > > ## Peering > - Monitors must poll all peers in their own Group, and use the health of > caches reported by those monitors combined with their own direct polling > results to achieve consensus on the health of each cache. > - This is currently implemented in Traffic Monitor, with each Monitor > polling all Caches in the CDN. > - The consensus will be Optimistic. If any Monitor directly polling a > cache considers it healthy, then it’s considered healthy. This is the > current behavior of Traffic Monitor. > - Should not preclude other consensus algorithms. > - Monitors must select an arbitrary Monitor in each other Monitor Group > each polling interval, to query for the health of CacheGroups not polled > directly. This selection must be deterministic and load-balanced. > - For example, deterministic such as via alphabetic hostname. > - For example, load-balanced by round-robin or consistent hash on the > deterministic selection order. > - Should not preclude alternate determinism or load-balance algorithms. > - The selected Monitor must be logged, for operational debugging. > - The number of CacheGroups each Monitor polls must be the number of > CacheGroups in the CDN which contain a cache divided by the number of > Monitor Groups. > - Note this means all Monitors in a Group poll the same Caches and > CacheGroups > - Note this means the number of Monitors in a Group is the level of > consensus. > - For example, if you want to make sure at least 3 Monitors poll every > Cache, then every Monitor Group must have 3 Monitors. > - The CacheGroups polled by each Monitor Group will be the nearest > geographically > - Should not preclude an alternate nearest algorithm (for example, > network distance) > - Note the number and which Caches polled above imply each Monitor must be > aware of which Caches are being polled by every other Monitor Group. > - For example, if there are 3 Monitor Groups and 10 CacheGroups with > Caches, then MonitorGroup1 must poll 3 CacheGroups, MonitorGroup2 must poll > 3 CacheGroups, and MonitorGroup3 must poll 4 CacheGroups. > - For safety, each Monitor must inspect the Peer results from every other > Monitor Group, and if any CacheGroup is unpolled, poll that CacheGroup > itself. > - This safety should not execute under normal operation. But it may > execute during a normal upgrade of the Monitors if the peering algorithm > was changed. > - If this occurs, an error must be logged. > - The algorithm to decide to poll an unpolled CacheGroup should be: > `time_unpolled * log(nth_distance) > poll_interval` > - Using the logarithm of the poll interval times distance prevents > “flapping” and all Monitors in the entire CDN polling a CacheGroup when any > slight network blip occurs, while still ensuring that any Monitor observing > an unpolled CacheGroup will poll it after some hard limit (like 10 > seconds). > - For safety, a Profile Parameter which overrides the CacheGroups polled by > a Monitor must be used if it exists. > - This is a safety to fix outages, for example if there is a bug deciding > which CacheGroups to poll. > - This should not be used under normal operation. > - For safety, a Profile Parameter which overrides the number of CacheGroups > to poll (normally automatically calculated from the number of MonitorGroups > divided by CacheGroups) must be used if it exists. > - This is a safety, to fix outages in an emergency. It should not be used > under normal operation. > - All failed polling must retry the selected target, and if the target is > an arbitrary group member, an alternative target. > - Retry intervals should be a factor of the poll interval by default. > - For example, if the poll interval is 1 second, retries should not > exceed around 3 seconds before marking as failed and stopping retrying. > - Retry intervals should be configurable > > ## Extensibility > The implementation should not preclude: > - Using a GSLB for Monitors and clients (Routers) to talk to an arbitrary > monitor in a Monitor Group. > - A GSLB provides some distinct advantages, but we need to support > non-GSLB for small ATC operators. Therefore, it will be an additional > feature, which may be implemented in a future version > - Adding more intelligent logic around Optimism, and how consensus is > chosen when monitors disagree. > - Adding “far” polling, to determine if a cache is reachable from far away, > and to report that cache as healthy or unhealthy to Traffic Routers or > clients far away. > - The present design only implements “near” polling, and does not > implement polling from “far” away, to ensure the cache is reachable to > clients far away. This is mostly ok, because clients should mostly be > requesting nearby Traffic Routers, which should mostly be sending them to > nearby caches. > However, far-away requests do happen regularly in Production. Thus, we > should retain the ability to implement “far health” if necessary in the > future. > - Ideally, “near health” would be sent to nearby Traffic Routers or > clients, and “far health” would be sent to far clients, reflecting their > own distance from the Cache. > But that granularity is expensive in both developer-time and performance. > - Adding an interface to send health information (such as from stats) to be > used in the health calculation. > > > Thoughts? Feedback? Votes? > > Once we have consensus on the list, we'll start on a git blueprint with > more technical details, and then also seek consensus on that. > > Normal Apache voting procedure is 72 hours, but this is large enough, I'd > like to give it at least a week, just to give everyone enough time to think > it through. > > So if you're interested in this project, please reply within a week with > feedback, or let us know if you need more time to think it through and we > can extend that time. > > Thanks, >