Some comments and questions jointly compiled

  - How is TM configured to monitor a subset of a CDN, is it a static
allocation of caches to TMs?

  - Can you describe how the primary + backup work. Do they both poll the
cache simultaneously

  - If a TM fails, how do the TMs heal / reallocate polling
responsibilities. Does another TM pick up the slack?

  - What prevents a misconfiguration where some caches are not polled by
any TM?

  - Are there any minimums/maximums to how many TMs will poll a cache?

  - What is meaning of non-boolean 0-100 health? How is this computed and
how is it used?

  - What can we do to further harden TM<->TM communications and reduce
blast radius?

Big thumbs up on decoupling TM from Traffic Ops. What does this practically
mean - no more monitoring.json? Can we document specifically which APIs TM
will use?
(Aside, we might want to think about this as an opportunity to move TM into
its own repository- assuming the community decides to go ahead with
separate repos per component).



On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <neu...@apache.org> wrote:

> Hey All,
> One of the things we have been talking about doing for a long time is
> making Traffic Monitor capable of monitoring a subset of the CDN so that it
> can be deployed in a distributed fashion.  The time has come for us to get
> moving on this.  We have had some discussions internally to understand what
> requirements we have for doing this, but I wanted to solicit feedback from
> the community to see if there are potentially other requirements that we
> may have missed.  Please take a look at the requirements we have identified
> below and let me know what feedback you have.  At this point in time I am
> trying to keep this conversation separate from the design conversation and
> just focus on the requirements.  Once we all agree on the requirements we
> can start discussing the design.  You will notice that this proposal also
> includes adding the ability to integrate with external monitoring systems.
> I figured now would be a good time to add that functionality in as well.
>
>
> *Abstract*
>
> Update Traffic Monitor so that it is capable of monitoring only part of the
> CDN while still providing a single API for clients to get cache stats,
> delivery stats, and cache availability for a whole CDN.  Add the ability to
> integrate with other systems that perform additional health monitoring and
> consider the status of these systems when making health decisions for a
> cache.  Ensure that the Traffic Monitor API is capable of serving thousands
> of simultaneous clients, such as all of the caches in a CDN.
>
>
> *Problem Statement*
>
> Currently Traffic Monitor can only monitor an entire CDN. This means that
> Traffic Monitor has to poll every single cache in a CDN before making cache
> health decisions and being able to provide statistics. This also means that
> Traffic Monitors need to be located in a centralized place where it can get
> to everything, which isn't exactly representative of what a client might
> see. While this has worked really well for us to date, we know that at some
> point we will run into scaling issues which prohibit us from polling caches
> faster.  In order to solve our impending scaling issues as well as improve
> our ability to make better and faster health decisions, Traffic Monitor
> needs to run in a distributed fashion instead of an all or nothing
> fashion.
>
> Furthermore, there is a growing need to provide support for external
> monitoring systems in Traffic Monitor.  Traffic Monitor needs to be able to
> use other monitoring systems to aid in the health decision process. While
> this could be solved in today's Traffic Monitor, it is best to solve this
> problem in conjunction with making the polling distributed.
> *Business Justification*
>
> In order to provide the best customer experience possible, we need to have
> a robust and timely health monitoring system.  While Traffic Monitor has
> been sufficient to date, we need to make sure that we are adapting to meet
> the needs of the near future and we need to make sure that we are evolving
> to continue to meet customers needs.  These changes to Traffic Monitor are
> imperative to providing as near real time as possible cache health data on
> our ever increasing in scale of the CDN.
> *Business Requirements*
>
>    - Traffic Monitor MUST be capable of being configured to monitor a
>    portion of a CDN
>    - Traffic Monitor MUST be capable of being configured to monitor all
>    caches in a CDN
>    - Traffic Monitor MUST provide an API to get the health status of ALL
>    caches in the CDN
>    - Traffic Monitor MUST provide an API to get statistics (from e.g.
>    astats data) generated by ALL caches in the CDN. This does not include
> any
>    statistics generated by external monitoring systems.
>    - Traffic Monitor MUST log all requests to its API including AT LEAST
>    the following information: timestamp, client IP, resource requested,
>    response code, response reason, time to serve.
>    - Traffic Monitor MUST provide an API to get the status of caches it
>    monitors
>    - Traffic Monitor MUST log all health state changes for a cache whether
>    the decision is made internally or from an external system.
>    - Traffic Monitor MUST provide the ability to have more than 1 Traffic
>    Monitor monitor the same cache and come to consensus on the health of
> the
>    cache.
>    - Traffic Monitor SHOULD provide the way to configure more than one
>    subset of caches to monitor – e.g. as a primary and backup.
>    - Traffic Monitor SHOULD provide a way to integrate with external
>    services to provide additional cache health monitoring
>    - Traffic Monitor SHOULD have the capability to provide a non-boolean
>    health score for a cache - e.g. a number between 0 - 100
>    - Traffic Monitor MAY be decoupled from Traffic Ops for configuration
>    generation
>

Reply via email to