The current Traffic Monitor polls the entire CDN, which doesn't scale with large numbers of caches.
We've long intended to make it scalable. Several of us have a design proposal. I'm putting it here, just so people don't have to click an external link, but if anyone wants I can also put this on the Apache wiki, just ask. We typically do designs with Blueprints in git, but this project is large enough, we felt the need to come up with a higher-level design first. Once we have consensus on the design, the intent is to make a blueprint with more technical details, and then seek consensus on that. Design follows. # Implementation - Should not mandate or preclude a deployment environment - Such as VMs or Containers - This design intentionally minimizes Traffic Ops/Traffic Portal changes, to reduce development cost for the MVP. TO changes may be desirable, and may be implemented in a future iteration. ## Stats The existing Traffic Monitor monitors both stats and health. The existing Monitor will have health polling disabled and will continue to be used for stat polling for the immediate future. Beyond that, this document does not address stat monitoring. Currently, it’s possible to use stats for health consideration, but it’s unlikely it’s commonly used by anyone. - Should not preclude adding an interface (JSON/HTTP) to send health information (such as from stats) to be used in the health calculation - For either cache or delivery service health - This won’t be in the MVP, but is likely to be added in a future version ## Health - Must get and use Snapshotted (CRConfig, monitoring.json) data exclusively. - Must Not request un-snapshotted Traffic Ops endpoints. - This is because health monitoring must match Router configuration, and to prevent potentially multi-state operational changes from being deployed until an operator intends. - Monitors must get their own config and data from Traffic Ops without intervention. - Deployment and running must be Stateless (in terms of persistent state; notwithstanding transient health state). - Local configuration should be limited to Traffic Ops authentication. - Must use Traffic Ops as the Source-Of-Truth for all data about what to poll, but should be agnostic (i.e. handle generic JSON endpoints without special Traffic Ops–specific logic). - Monitors will be divided into “monitor groups,” which will be implemented as Traffic Control CacheGroups. - CacheGroups will be used to minimize Traffic Ops (TO) and Traffic Portal (TP) development in the MVP. - These Monitor CacheGroups have no correlation with Cache CacheGroups. TO CacheGroups are simply the grouping mechanism. - Should not preclude a different grouping mechanism. - Monitors will directly poll the health of caches in their designated CacheGroup(s), and will request the health of other CacheGroups from a Monitor in each other Monitor Group. - Monitors will serve the health of all caches in the CDN. - Both cache health directly polled and received from other monitors. - Monitors must distinguish in their health endpoint (/CrStates) between caches/cachegroups they polled themselves, and health received from other monitors. - The endpoint should be backwards-compatible with the existing /CrStates endpoint used by existing Traffic Routers. - Cache health should be boolean, but must not preclude a more granular score in the future - For example, 0.0–1.0 or 0–100 - The current Traffic Monitor cache health is boolean - Note this is a “must,” not a “should,” unlike most extensibility requirements. ## Peering - Monitors must poll all peers in their own Group, and use the health of caches reported by those monitors combined with their own direct polling results to achieve consensus on the health of each cache. - This is currently implemented in Traffic Monitor, with each Monitor polling all Caches in the CDN. - The consensus will be Optimistic. If any Monitor directly polling a cache considers it healthy, then it’s considered healthy. This is the current behavior of Traffic Monitor. - Should not preclude other consensus algorithms. - Monitors must select an arbitrary Monitor in each other Monitor Group each polling interval, to query for the health of CacheGroups not polled directly. This selection must be deterministic and load-balanced. - For example, deterministic such as via alphabetic hostname. - For example, load-balanced by round-robin or consistent hash on the deterministic selection order. - Should not preclude alternate determinism or load-balance algorithms. - The selected Monitor must be logged, for operational debugging. - The number of CacheGroups each Monitor polls must be the number of CacheGroups in the CDN which contain a cache divided by the number of Monitor Groups. - Note this means all Monitors in a Group poll the same Caches and CacheGroups - Note this means the number of Monitors in a Group is the level of consensus. - For example, if you want to make sure at least 3 Monitors poll every Cache, then every Monitor Group must have 3 Monitors. - The CacheGroups polled by each Monitor Group will be the nearest geographically - Should not preclude an alternate nearest algorithm (for example, network distance) - Note the number and which Caches polled above imply each Monitor must be aware of which Caches are being polled by every other Monitor Group. - For example, if there are 3 Monitor Groups and 10 CacheGroups with Caches, then MonitorGroup1 must poll 3 CacheGroups, MonitorGroup2 must poll 3 CacheGroups, and MonitorGroup3 must poll 4 CacheGroups. - For safety, each Monitor must inspect the Peer results from every other Monitor Group, and if any CacheGroup is unpolled, poll that CacheGroup itself. - This safety should not execute under normal operation. But it may execute during a normal upgrade of the Monitors if the peering algorithm was changed. - If this occurs, an error must be logged. - The algorithm to decide to poll an unpolled CacheGroup should be: `time_unpolled * log(nth_distance) > poll_interval` - Using the logarithm of the poll interval times distance prevents “flapping” and all Monitors in the entire CDN polling a CacheGroup when any slight network blip occurs, while still ensuring that any Monitor observing an unpolled CacheGroup will poll it after some hard limit (like 10 seconds). - For safety, a Profile Parameter which overrides the CacheGroups polled by a Monitor must be used if it exists. - This is a safety to fix outages, for example if there is a bug deciding which CacheGroups to poll. - This should not be used under normal operation. - For safety, a Profile Parameter which overrides the number of CacheGroups to poll (normally automatically calculated from the number of MonitorGroups divided by CacheGroups) must be used if it exists. - This is a safety, to fix outages in an emergency. It should not be used under normal operation. - All failed polling must retry the selected target, and if the target is an arbitrary group member, an alternative target. - Retry intervals should be a factor of the poll interval by default. - For example, if the poll interval is 1 second, retries should not exceed around 3 seconds before marking as failed and stopping retrying. - Retry intervals should be configurable ## Extensibility The implementation should not preclude: - Using a GSLB for Monitors and clients (Routers) to talk to an arbitrary monitor in a Monitor Group. - A GSLB provides some distinct advantages, but we need to support non-GSLB for small ATC operators. Therefore, it will be an additional feature, which may be implemented in a future version - Adding more intelligent logic around Optimism, and how consensus is chosen when monitors disagree. - Adding “far” polling, to determine if a cache is reachable from far away, and to report that cache as healthy or unhealthy to Traffic Routers or clients far away. - The present design only implements “near” polling, and does not implement polling from “far” away, to ensure the cache is reachable to clients far away. This is mostly ok, because clients should mostly be requesting nearby Traffic Routers, which should mostly be sending them to nearby caches. However, far-away requests do happen regularly in Production. Thus, we should retain the ability to implement “far health” if necessary in the future. - Ideally, “near health” would be sent to nearby Traffic Routers or clients, and “far health” would be sent to far clients, reflecting their own distance from the Cache. But that granularity is expensive in both developer-time and performance. - Adding an interface to send health information (such as from stats) to be used in the health calculation. Thoughts? Feedback? Votes? Once we have consensus on the list, we'll start on a git blueprint with more technical details, and then also seek consensus on that. Normal Apache voting procedure is 72 hours, but this is large enough, I'd like to give it at least a week, just to give everyone enough time to think it through. So if you're interested in this project, please reply within a week with feedback, or let us know if you need more time to think it through and we can extend that time. Thanks,