Re: Distributed Traffic Monitor Feedback/Requirements

Dave Neuman Fri, 25 Jun 2021 09:41:51 -0700

Sounds great, thanks Eric!
I am looking forward to the design discussions.
--Dave


On Fri, Jun 25, 2021 at 9:17 AM Eric Friedrich <fri...@apache.org> wrote:

> I'll do my best to rephrase as a potential requirement :-)
>
> 1) Traffic Monitor MUST ensure all caches are monitored upon failure of any
> TM server(s) or physical location. (i.e. no SPoF of TMs for
> polling/aggregation).
>
> Number of TM failures to be tolerated before we stop polling some caches /
> how we accomplish the above/ maximum number of caches under supervision by
> a TM are all TBD in design phase
>
> --Eric
>
> On Fri, Jun 25, 2021 at 10:36 AM Dave Neuman <neu...@apache.org> wrote:
>
> > Hey Eric,
> > Thanks for the questions/feedback.  My responses are inline below.  Most
> of
> > your questions will need to be addressed when we do design as right now I
> > just want to make sure we are not missing any requirements.  I hope to
> > start design discussions in the next week or two.
> >
> > Thanks,
> > Dave
> >
> > On Fri, Jun 25, 2021 at 7:26 AM Eric Friedrich <fri...@apache.org>
> wrote:
> >
> > > Some comments and questions jointly compiled
> > >
> > >   - How is TM configured to monitor a subset of a CDN, is it a static
> > > allocation of caches to TMs?
> > >
> >
> > DN:  I think that is to be determined when we start to think about
> design,
> > which is after we agree on the requirements.  I think for our use case
> the
> > most simple way to do this would be by cache group.  A Traffic Monitor
> > could be configured to monitor 1 to many cache groups.  However, if there
> > is a better way we could do this, I am all ears.
> >
> > >
> > >   - Can you describe how the primary + backup work. Do they both poll
> the
> > > cache simultaneously
> > >
> >
> > DN: Again, I think we can sort out the details when we talk about design.
> > It actually might make more sense to just have multiple TMs monitor a
> cache
> > group and treat them all as "live", this has the benefit of providing
> more
> > than one view of a cache.
> >
> >
> > >   - If a TM fails, how do the TMs heal / reallocate polling
> > > responsibilities. Does another TM pick up the slack?
> > >
> >
> > DN:  You want to dive straight into design :). I think the easiest answer
> > here is to ensure multiple TMs are polling each cache and that they are
> all
> > treated as live, then we can just use the optimistic consensus that is
> > already built into TM.
> >
> >
> > >
> > >   - What prevents a misconfiguration where some caches are not polled
> by
> > > any TM?
> > >
> >
> > DN:  Great question.  I don't think that is one I have considered, but I
> > suppose we could add a requirement saying that TM must have a way to
> > identify unpolled caches...what do you think?
> >
> >
> > >
> > >   - Are there any minimums/maximums to how many TMs will poll a cache?
> > >
> > DN: Minimum is one, maximum is up to the operator, I don't know of a
> limit
> > in TM.
> >
> >
> > >
> > >   - What is meaning of non-boolean 0-100 health? How is this computed
> and
> > > how is it used?
> > >
> >
> > DN:  The health score stuff is going to be an entirely different topic
> > because I don't think it needs to be conflated with distributed
> polling.  I
> > put that requirement in because I wanted to document that this is
> something
> > we are thinking about so that we don't make it difficult on ourselves
> when
> > we do this refactor.
> > Right now a cache's health is boolean, it either gets traffic or it
> > doesn't.  The idea behind the health score is that we could assign
> > different health scores for caches in a cache group and then TR can use
> > that when determining which cache to choose.  Maybe you have multiple
> > caches that are getting close to the bandwidth limit, instead of pulling
> > all traffic from them, we could simply weight them lower so the TR
> prefers
> > other caches, but can still use them if needed. We have a bunch of other
> > use cases that are probably best saved for when we are ready to formally
> > present the idea.
> >
> >
> > >
> > >   - What can we do to further harden TM<->TM communications and reduce
> > > blast radius?
> > >
> >
> > DN:  Another topic for the design discussions, I think the basic idea is
> to
> > not have a SPoF which means multiple TMs polling each cache and multiple
> > TMs available to provide status to TRs, Caches, and TSs.
> >
> >
> >
> > > Big thumbs up on decoupling TM from Traffic Ops. What does this
> > practically
> > > mean - no more monitoring.json? Can we document specifically which APIs
> > TM
> > > will use?
> > > (Aside, we might want to think about this as an opportunity to move TM
> > into
> > > its own repository- assuming the community decides to go ahead with
> > > separate repos per component).
> > >
> >
> > DN:  I think that is a stretch goal for now.  TM will still have to get
> > it's configuration from somewhere, but ideally it does not have to come
> > from TO.  Ultimately I would like TO to just serve the basic data from
> the
> > database and build services that can be used to generate configs using
> > business logic.  We sort of did this with t3c where it gets all of the
> > information it needs from TO without relying on config file APIs
> > that used to be in TO (maybe still are?).  However, t3c is purely client
> > side and I prefer a more centralized approach with something like a TM
> > configuration service that can read from TO and use the data to populate
> > APIs for TM to get it's config.  That way we could define just the data
> we
> > need in TM and a user could choose to run the TM configuration service
> > which talks to TO or provide the required data using a different backend
> > system.  I think this is probably a larger conversation we need to have
> > when we start talking about how we are going to design the distributed
> TM.
> >
> > As for its own repo, that is a larger conversation.  I am not sure what
> > that means for all of the ancillary pieces like cdn-in-a-box, the pkg
> > script, etc. If it is worth the trouble then I am all for it, but I don't
> > think we should let this thread get bogged down with that conversation.
> >
> > >
> > >
> > >
> > > On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <neu...@apache.org> wrote:
> > >
> > > > Hey All,
> > > > One of the things we have been talking about doing for a long time is
> > > > making Traffic Monitor capable of monitoring a subset of the CDN so
> > that
> > > it
> > > > can be deployed in a distributed fashion.  The time has come for us
> to
> > > get
> > > > moving on this.  We have had some discussions internally to
> understand
> > > what
> > > > requirements we have for doing this, but I wanted to solicit feedback
> > > from
> > > > the community to see if there are potentially other requirements that
> > we
> > > > may have missed.  Please take a look at the requirements we have
> > > identified
> > > > below and let me know what feedback you have.  At this point in time
> I
> > am
> > > > trying to keep this conversation separate from the design
> conversation
> > > and
> > > > just focus on the requirements.  Once we all agree on the
> requirements
> > we
> > > > can start discussing the design.  You will notice that this proposal
> > also
> > > > includes adding the ability to integrate with external monitoring
> > > systems.
> > > > I figured now would be a good time to add that functionality in as
> > well.
> > > >
> > > >
> > > > *Abstract*
> > > >
> > > > Update Traffic Monitor so that it is capable of monitoring only part
> of
> > > the
> > > > CDN while still providing a single API for clients to get cache
> stats,
> > > > delivery stats, and cache availability for a whole CDN.  Add the
> > ability
> > > to
> > > > integrate with other systems that perform additional health
> monitoring
> > > and
> > > > consider the status of these systems when making health decisions
> for a
> > > > cache.  Ensure that the Traffic Monitor API is capable of serving
> > > thousands
> > > > of simultaneous clients, such as all of the caches in a CDN.
> > > >
> > > >
> > > > *Problem Statement*
> > > >
> > > > Currently Traffic Monitor can only monitor an entire CDN. This means
> > that
> > > > Traffic Monitor has to poll every single cache in a CDN before making
> > > cache
> > > > health decisions and being able to provide statistics. This also
> means
> > > that
> > > > Traffic Monitors need to be located in a centralized place where it
> can
> > > get
> > > > to everything, which isn't exactly representative of what a client
> > might
> > > > see. While this has worked really well for us to date, we know that
> at
> > > some
> > > > point we will run into scaling issues which prohibit us from polling
> > > caches
> > > > faster.  In order to solve our impending scaling issues as well as
> > > improve
> > > > our ability to make better and faster health decisions, Traffic
> Monitor
> > > > needs to run in a distributed fashion instead of an all or nothing
> > > > fashion.
> > > >
> > > > Furthermore, there is a growing need to provide support for external
> > > > monitoring systems in Traffic Monitor.  Traffic Monitor needs to be
> > able
> > > to
> > > > use other monitoring systems to aid in the health decision process.
> > While
> > > > this could be solved in today's Traffic Monitor, it is best to solve
> > this
> > > > problem in conjunction with making the polling distributed.
> > > > *Business Justification*
> > > >
> > > > In order to provide the best customer experience possible, we need to
> > > have
> > > > a robust and timely health monitoring system.  While Traffic Monitor
> > has
> > > > been sufficient to date, we need to make sure that we are adapting to
> > > meet
> > > > the needs of the near future and we need to make sure that we are
> > > evolving
> > > > to continue to meet customers needs.  These changes to Traffic
> Monitor
> > > are
> > > > imperative to providing as near real time as possible cache health
> data
> > > on
> > > > our ever increasing in scale of the CDN.
> > > > *Business Requirements*
> > > >
> > > >    - Traffic Monitor MUST be capable of being configured to monitor a
> > > >    portion of a CDN
> > > >    - Traffic Monitor MUST be capable of being configured to monitor
> all
> > > >    caches in a CDN
> > > >    - Traffic Monitor MUST provide an API to get the health status of
> > ALL
> > > >    caches in the CDN
> > > >    - Traffic Monitor MUST provide an API to get statistics (from e.g.
> > > >    astats data) generated by ALL caches in the CDN. This does not
> > include
> > > > any
> > > >    statistics generated by external monitoring systems.
> > > >    - Traffic Monitor MUST log all requests to its API including AT
> > LEAST
> > > >    the following information: timestamp, client IP, resource
> requested,
> > > >    response code, response reason, time to serve.
> > > >    - Traffic Monitor MUST provide an API to get the status of caches
> it
> > > >    monitors
> > > >    - Traffic Monitor MUST log all health state changes for a cache
> > > whether
> > > >    the decision is made internally or from an external system.
> > > >    - Traffic Monitor MUST provide the ability to have more than 1
> > Traffic
> > > >    Monitor monitor the same cache and come to consensus on the health
> > of
> > > > the
> > > >    cache.
> > > >    - Traffic Monitor SHOULD provide the way to configure more than
> one
> > > >    subset of caches to monitor – e.g. as a primary and backup.
> > > >    - Traffic Monitor SHOULD provide a way to integrate with external
> > > >    services to provide additional cache health monitoring
> > > >    - Traffic Monitor SHOULD have the capability to provide a
> > non-boolean
> > > >    health score for a cache - e.g. a number between 0 - 100
> > > >    - Traffic Monitor MAY be decoupled from Traffic Ops for
> > configuration
> > > >    generation
> > > >
> > >
> >
>

Re: Distributed Traffic Monitor Feedback/Requirements

Reply via email to