Some comments and questions jointly compiled - How is TM configured to monitor a subset of a CDN, is it a static allocation of caches to TMs?
- Can you describe how the primary + backup work. Do they both poll the cache simultaneously - If a TM fails, how do the TMs heal / reallocate polling responsibilities. Does another TM pick up the slack? - What prevents a misconfiguration where some caches are not polled by any TM? - Are there any minimums/maximums to how many TMs will poll a cache? - What is meaning of non-boolean 0-100 health? How is this computed and how is it used? - What can we do to further harden TM<->TM communications and reduce blast radius? Big thumbs up on decoupling TM from Traffic Ops. What does this practically mean - no more monitoring.json? Can we document specifically which APIs TM will use? (Aside, we might want to think about this as an opportunity to move TM into its own repository- assuming the community decides to go ahead with separate repos per component). On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <neu...@apache.org> wrote: > Hey All, > One of the things we have been talking about doing for a long time is > making Traffic Monitor capable of monitoring a subset of the CDN so that it > can be deployed in a distributed fashion. The time has come for us to get > moving on this. We have had some discussions internally to understand what > requirements we have for doing this, but I wanted to solicit feedback from > the community to see if there are potentially other requirements that we > may have missed. Please take a look at the requirements we have identified > below and let me know what feedback you have. At this point in time I am > trying to keep this conversation separate from the design conversation and > just focus on the requirements. Once we all agree on the requirements we > can start discussing the design. You will notice that this proposal also > includes adding the ability to integrate with external monitoring systems. > I figured now would be a good time to add that functionality in as well. > > > *Abstract* > > Update Traffic Monitor so that it is capable of monitoring only part of the > CDN while still providing a single API for clients to get cache stats, > delivery stats, and cache availability for a whole CDN. Add the ability to > integrate with other systems that perform additional health monitoring and > consider the status of these systems when making health decisions for a > cache. Ensure that the Traffic Monitor API is capable of serving thousands > of simultaneous clients, such as all of the caches in a CDN. > > > *Problem Statement* > > Currently Traffic Monitor can only monitor an entire CDN. This means that > Traffic Monitor has to poll every single cache in a CDN before making cache > health decisions and being able to provide statistics. This also means that > Traffic Monitors need to be located in a centralized place where it can get > to everything, which isn't exactly representative of what a client might > see. While this has worked really well for us to date, we know that at some > point we will run into scaling issues which prohibit us from polling caches > faster. In order to solve our impending scaling issues as well as improve > our ability to make better and faster health decisions, Traffic Monitor > needs to run in a distributed fashion instead of an all or nothing > fashion. > > Furthermore, there is a growing need to provide support for external > monitoring systems in Traffic Monitor. Traffic Monitor needs to be able to > use other monitoring systems to aid in the health decision process. While > this could be solved in today's Traffic Monitor, it is best to solve this > problem in conjunction with making the polling distributed. > *Business Justification* > > In order to provide the best customer experience possible, we need to have > a robust and timely health monitoring system. While Traffic Monitor has > been sufficient to date, we need to make sure that we are adapting to meet > the needs of the near future and we need to make sure that we are evolving > to continue to meet customers needs. These changes to Traffic Monitor are > imperative to providing as near real time as possible cache health data on > our ever increasing in scale of the CDN. > *Business Requirements* > > - Traffic Monitor MUST be capable of being configured to monitor a > portion of a CDN > - Traffic Monitor MUST be capable of being configured to monitor all > caches in a CDN > - Traffic Monitor MUST provide an API to get the health status of ALL > caches in the CDN > - Traffic Monitor MUST provide an API to get statistics (from e.g. > astats data) generated by ALL caches in the CDN. This does not include > any > statistics generated by external monitoring systems. > - Traffic Monitor MUST log all requests to its API including AT LEAST > the following information: timestamp, client IP, resource requested, > response code, response reason, time to serve. > - Traffic Monitor MUST provide an API to get the status of caches it > monitors > - Traffic Monitor MUST log all health state changes for a cache whether > the decision is made internally or from an external system. > - Traffic Monitor MUST provide the ability to have more than 1 Traffic > Monitor monitor the same cache and come to consensus on the health of > the > cache. > - Traffic Monitor SHOULD provide the way to configure more than one > subset of caches to monitor – e.g. as a primary and backup. > - Traffic Monitor SHOULD provide a way to integrate with external > services to provide additional cache health monitoring > - Traffic Monitor SHOULD have the capability to provide a non-boolean > health score for a cache - e.g. a number between 0 - 100 > - Traffic Monitor MAY be decoupled from Traffic Ops for configuration > generation >