Hey All,
One of the things we have been talking about doing for a long time is
making Traffic Monitor capable of monitoring a subset of the CDN so that it
can be deployed in a distributed fashion.  The time has come for us to get
moving on this.  We have had some discussions internally to understand what
requirements we have for doing this, but I wanted to solicit feedback from
the community to see if there are potentially other requirements that we
may have missed.  Please take a look at the requirements we have identified
below and let me know what feedback you have.  At this point in time I am
trying to keep this conversation separate from the design conversation and
just focus on the requirements.  Once we all agree on the requirements we
can start discussing the design.  You will notice that this proposal also
includes adding the ability to integrate with external monitoring systems.
I figured now would be a good time to add that functionality in as well.


*Abstract*

Update Traffic Monitor so that it is capable of monitoring only part of the
CDN while still providing a single API for clients to get cache stats,
delivery stats, and cache availability for a whole CDN.  Add the ability to
integrate with other systems that perform additional health monitoring and
consider the status of these systems when making health decisions for a
cache.  Ensure that the Traffic Monitor API is capable of serving thousands
of simultaneous clients, such as all of the caches in a CDN.


*Problem Statement*

Currently Traffic Monitor can only monitor an entire CDN. This means that
Traffic Monitor has to poll every single cache in a CDN before making cache
health decisions and being able to provide statistics. This also means that
Traffic Monitors need to be located in a centralized place where it can get
to everything, which isn't exactly representative of what a client might
see. While this has worked really well for us to date, we know that at some
point we will run into scaling issues which prohibit us from polling caches
faster.  In order to solve our impending scaling issues as well as improve
our ability to make better and faster health decisions, Traffic Monitor
needs to run in a distributed fashion instead of an all or nothing
fashion.

Furthermore, there is a growing need to provide support for external
monitoring systems in Traffic Monitor.  Traffic Monitor needs to be able to
use other monitoring systems to aid in the health decision process. While
this could be solved in today's Traffic Monitor, it is best to solve this
problem in conjunction with making the polling distributed.
*Business Justification*

In order to provide the best customer experience possible, we need to have
a robust and timely health monitoring system.  While Traffic Monitor has
been sufficient to date, we need to make sure that we are adapting to meet
the needs of the near future and we need to make sure that we are evolving
to continue to meet customers needs.  These changes to Traffic Monitor are
imperative to providing as near real time as possible cache health data on
our ever increasing in scale of the CDN.
*Business Requirements*

   - Traffic Monitor MUST be capable of being configured to monitor a
   portion of a CDN
   - Traffic Monitor MUST be capable of being configured to monitor all
   caches in a CDN
   - Traffic Monitor MUST provide an API to get the health status of ALL
   caches in the CDN
   - Traffic Monitor MUST provide an API to get statistics (from e.g.
   astats data) generated by ALL caches in the CDN. This does not include any
   statistics generated by external monitoring systems.
   - Traffic Monitor MUST log all requests to its API including AT LEAST
   the following information: timestamp, client IP, resource requested,
   response code, response reason, time to serve.
   - Traffic Monitor MUST provide an API to get the status of caches it
   monitors
   - Traffic Monitor MUST log all health state changes for a cache whether
   the decision is made internally or from an external system.
   - Traffic Monitor MUST provide the ability to have more than 1 Traffic
   Monitor monitor the same cache and come to consensus on the health of the
   cache.
   - Traffic Monitor SHOULD provide the way to configure more than one
   subset of caches to monitor – e.g. as a primary and backup.
   - Traffic Monitor SHOULD provide a way to integrate with external
   services to provide additional cache health monitoring
   - Traffic Monitor SHOULD have the capability to provide a non-boolean
   health score for a cache - e.g. a number between 0 - 100
   - Traffic Monitor MAY be decoupled from Traffic Ops for configuration
   generation

Reply via email to