Sounds great, thanks Eric! I am looking forward to the design discussions. --Dave
On Fri, Jun 25, 2021 at 9:17 AM Eric Friedrich <fri...@apache.org> wrote: > I'll do my best to rephrase as a potential requirement :-) > > 1) Traffic Monitor MUST ensure all caches are monitored upon failure of any > TM server(s) or physical location. (i.e. no SPoF of TMs for > polling/aggregation). > > Number of TM failures to be tolerated before we stop polling some caches / > how we accomplish the above/ maximum number of caches under supervision by > a TM are all TBD in design phase > > --Eric > > On Fri, Jun 25, 2021 at 10:36 AM Dave Neuman <neu...@apache.org> wrote: > > > Hey Eric, > > Thanks for the questions/feedback. My responses are inline below. Most > of > > your questions will need to be addressed when we do design as right now I > > just want to make sure we are not missing any requirements. I hope to > > start design discussions in the next week or two. > > > > Thanks, > > Dave > > > > On Fri, Jun 25, 2021 at 7:26 AM Eric Friedrich <fri...@apache.org> > wrote: > > > > > Some comments and questions jointly compiled > > > > > > - How is TM configured to monitor a subset of a CDN, is it a static > > > allocation of caches to TMs? > > > > > > > DN: I think that is to be determined when we start to think about > design, > > which is after we agree on the requirements. I think for our use case > the > > most simple way to do this would be by cache group. A Traffic Monitor > > could be configured to monitor 1 to many cache groups. However, if there > > is a better way we could do this, I am all ears. > > > > > > > > - Can you describe how the primary + backup work. Do they both poll > the > > > cache simultaneously > > > > > > > DN: Again, I think we can sort out the details when we talk about design. > > It actually might make more sense to just have multiple TMs monitor a > cache > > group and treat them all as "live", this has the benefit of providing > more > > than one view of a cache. > > > > > > > - If a TM fails, how do the TMs heal / reallocate polling > > > responsibilities. Does another TM pick up the slack? > > > > > > > DN: You want to dive straight into design :). I think the easiest answer > > here is to ensure multiple TMs are polling each cache and that they are > all > > treated as live, then we can just use the optimistic consensus that is > > already built into TM. > > > > > > > > > > - What prevents a misconfiguration where some caches are not polled > by > > > any TM? > > > > > > > DN: Great question. I don't think that is one I have considered, but I > > suppose we could add a requirement saying that TM must have a way to > > identify unpolled caches...what do you think? > > > > > > > > > > - Are there any minimums/maximums to how many TMs will poll a cache? > > > > > DN: Minimum is one, maximum is up to the operator, I don't know of a > limit > > in TM. > > > > > > > > > > - What is meaning of non-boolean 0-100 health? How is this computed > and > > > how is it used? > > > > > > > DN: The health score stuff is going to be an entirely different topic > > because I don't think it needs to be conflated with distributed > polling. I > > put that requirement in because I wanted to document that this is > something > > we are thinking about so that we don't make it difficult on ourselves > when > > we do this refactor. > > Right now a cache's health is boolean, it either gets traffic or it > > doesn't. The idea behind the health score is that we could assign > > different health scores for caches in a cache group and then TR can use > > that when determining which cache to choose. Maybe you have multiple > > caches that are getting close to the bandwidth limit, instead of pulling > > all traffic from them, we could simply weight them lower so the TR > prefers > > other caches, but can still use them if needed. We have a bunch of other > > use cases that are probably best saved for when we are ready to formally > > present the idea. > > > > > > > > > > - What can we do to further harden TM<->TM communications and reduce > > > blast radius? > > > > > > > DN: Another topic for the design discussions, I think the basic idea is > to > > not have a SPoF which means multiple TMs polling each cache and multiple > > TMs available to provide status to TRs, Caches, and TSs. > > > > > > > > > Big thumbs up on decoupling TM from Traffic Ops. What does this > > practically > > > mean - no more monitoring.json? Can we document specifically which APIs > > TM > > > will use? > > > (Aside, we might want to think about this as an opportunity to move TM > > into > > > its own repository- assuming the community decides to go ahead with > > > separate repos per component). > > > > > > > DN: I think that is a stretch goal for now. TM will still have to get > > it's configuration from somewhere, but ideally it does not have to come > > from TO. Ultimately I would like TO to just serve the basic data from > the > > database and build services that can be used to generate configs using > > business logic. We sort of did this with t3c where it gets all of the > > information it needs from TO without relying on config file APIs > > that used to be in TO (maybe still are?). However, t3c is purely client > > side and I prefer a more centralized approach with something like a TM > > configuration service that can read from TO and use the data to populate > > APIs for TM to get it's config. That way we could define just the data > we > > need in TM and a user could choose to run the TM configuration service > > which talks to TO or provide the required data using a different backend > > system. I think this is probably a larger conversation we need to have > > when we start talking about how we are going to design the distributed > TM. > > > > As for its own repo, that is a larger conversation. I am not sure what > > that means for all of the ancillary pieces like cdn-in-a-box, the pkg > > script, etc. If it is worth the trouble then I am all for it, but I don't > > think we should let this thread get bogged down with that conversation. > > > > > > > > > > > > > > On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <neu...@apache.org> wrote: > > > > > > > Hey All, > > > > One of the things we have been talking about doing for a long time is > > > > making Traffic Monitor capable of monitoring a subset of the CDN so > > that > > > it > > > > can be deployed in a distributed fashion. The time has come for us > to > > > get > > > > moving on this. We have had some discussions internally to > understand > > > what > > > > requirements we have for doing this, but I wanted to solicit feedback > > > from > > > > the community to see if there are potentially other requirements that > > we > > > > may have missed. Please take a look at the requirements we have > > > identified > > > > below and let me know what feedback you have. At this point in time > I > > am > > > > trying to keep this conversation separate from the design > conversation > > > and > > > > just focus on the requirements. Once we all agree on the > requirements > > we > > > > can start discussing the design. You will notice that this proposal > > also > > > > includes adding the ability to integrate with external monitoring > > > systems. > > > > I figured now would be a good time to add that functionality in as > > well. > > > > > > > > > > > > *Abstract* > > > > > > > > Update Traffic Monitor so that it is capable of monitoring only part > of > > > the > > > > CDN while still providing a single API for clients to get cache > stats, > > > > delivery stats, and cache availability for a whole CDN. Add the > > ability > > > to > > > > integrate with other systems that perform additional health > monitoring > > > and > > > > consider the status of these systems when making health decisions > for a > > > > cache. Ensure that the Traffic Monitor API is capable of serving > > > thousands > > > > of simultaneous clients, such as all of the caches in a CDN. > > > > > > > > > > > > *Problem Statement* > > > > > > > > Currently Traffic Monitor can only monitor an entire CDN. This means > > that > > > > Traffic Monitor has to poll every single cache in a CDN before making > > > cache > > > > health decisions and being able to provide statistics. This also > means > > > that > > > > Traffic Monitors need to be located in a centralized place where it > can > > > get > > > > to everything, which isn't exactly representative of what a client > > might > > > > see. While this has worked really well for us to date, we know that > at > > > some > > > > point we will run into scaling issues which prohibit us from polling > > > caches > > > > faster. In order to solve our impending scaling issues as well as > > > improve > > > > our ability to make better and faster health decisions, Traffic > Monitor > > > > needs to run in a distributed fashion instead of an all or nothing > > > > fashion. > > > > > > > > Furthermore, there is a growing need to provide support for external > > > > monitoring systems in Traffic Monitor. Traffic Monitor needs to be > > able > > > to > > > > use other monitoring systems to aid in the health decision process. > > While > > > > this could be solved in today's Traffic Monitor, it is best to solve > > this > > > > problem in conjunction with making the polling distributed. > > > > *Business Justification* > > > > > > > > In order to provide the best customer experience possible, we need to > > > have > > > > a robust and timely health monitoring system. While Traffic Monitor > > has > > > > been sufficient to date, we need to make sure that we are adapting to > > > meet > > > > the needs of the near future and we need to make sure that we are > > > evolving > > > > to continue to meet customers needs. These changes to Traffic > Monitor > > > are > > > > imperative to providing as near real time as possible cache health > data > > > on > > > > our ever increasing in scale of the CDN. > > > > *Business Requirements* > > > > > > > > - Traffic Monitor MUST be capable of being configured to monitor a > > > > portion of a CDN > > > > - Traffic Monitor MUST be capable of being configured to monitor > all > > > > caches in a CDN > > > > - Traffic Monitor MUST provide an API to get the health status of > > ALL > > > > caches in the CDN > > > > - Traffic Monitor MUST provide an API to get statistics (from e.g. > > > > astats data) generated by ALL caches in the CDN. This does not > > include > > > > any > > > > statistics generated by external monitoring systems. > > > > - Traffic Monitor MUST log all requests to its API including AT > > LEAST > > > > the following information: timestamp, client IP, resource > requested, > > > > response code, response reason, time to serve. > > > > - Traffic Monitor MUST provide an API to get the status of caches > it > > > > monitors > > > > - Traffic Monitor MUST log all health state changes for a cache > > > whether > > > > the decision is made internally or from an external system. > > > > - Traffic Monitor MUST provide the ability to have more than 1 > > Traffic > > > > Monitor monitor the same cache and come to consensus on the health > > of > > > > the > > > > cache. > > > > - Traffic Monitor SHOULD provide the way to configure more than > one > > > > subset of caches to monitor – e.g. as a primary and backup. > > > > - Traffic Monitor SHOULD provide a way to integrate with external > > > > services to provide additional cache health monitoring > > > > - Traffic Monitor SHOULD have the capability to provide a > > non-boolean > > > > health score for a cache - e.g. a number between 0 - 100 > > > > - Traffic Monitor MAY be decoupled from Traffic Ops for > > configuration > > > > generation > > > > > > > > > >