JvD, you're spoiling the surprise ending! :D But I do think one of the goals is to transmit subsets of the data, not single enormous data objects. And we have to get from here to there in a sane, supportable, and testable way.
It's been a couple of days and I haven't heard any opposition to cache-side processing. Sounds like the community has consensus on this point. There're some un-handled objections outstanding on the other issues, particularly that of inverting LoC for updates. I think we can engineer consensus out of that issue, eventually, though. But we should do first things first and do cache-side processing first. Maybe we'll learn something new in the process. On Thu, Aug 1, 2019 at 7:46 PM Jan van Doorn <[email protected]> wrote: > > > > > On Aug 1, 2019, at 18:10, Evan Zelkowitz <[email protected]> wrote: > > > > Also a +1 on the defining a standard common language that this library > > will use instead of direct dependencies on TO data. I know it has been > > bandied about for years that instead of having a direct TOdata->ATS > > config instead we have TOdata->cache side library->generic CDN > > language->$LocalCacheSoftwareConfig > > > > Would be nice to keep that dream alive. So now you would have a > > library to handle TO data conversion to generic data, then separate > > libraries per cache software to determine config from that data > > Maybe use > https://github.com/apache/trafficserver/blob/voluspa/tools/voluspa/schema_v1.json > > <https://github.com/apache/trafficserver/blob/voluspa/tools/voluspa/schema_v1.json> > as a start for this generic CDN language? > > Rgds, > JvD > > > > > On Thu, Aug 1, 2019 at 4:17 PM Rawlin Peters <[email protected]> > > wrote: > >> > >> It sounds like: > >> (A) everyone is +1 on cache-side config generation > >> (B) most people are -1 on caches connecting directly to the TO DB > >> (C) most people are +1 on TO pushing data to ORT instead of the other way > >> around > >> (D) most people are -1 on using Kafka for cache configs > >> > >> For (A) I'm +1 on the approach (ORT sidecar), but I think if we can > >> design the ats config gen library in a way that it just takes in a > >> well-defined set of input data and returns strings (the generated > >> config files) as output, it shouldn't really matter if the input data > >> comes from direct DB queries, API calls from ORT to TO, or pushed data > >> from TO to ORT. Whatever that library ends up looking like, it should > >> just be a matter of getting that data from some source, and converting > >> it into the input format expected by the library. The library should > >> not have any dependency on external data -- only what has been passed > >> into the library's function. Then we will get a lot of nice benefits > >> in terms of testability and reusability. > >> > >> Testability: > >> Given a set of input data, we can expect certain output in terms of > >> the ATS config files, so it would be easy to write unit tests for. > >> Right now that's a hard thing to do because we have to mock out every > >> single DB call for those unit tests > >> > >> Reusability: > >> The library could be shared between the TO API and this new > >> ORT-sidecar thing, and the only difference should be that the TO API > >> runs a set of DB queries to populate the input data whereas the > >> ORT-sidecar thing would run a set of TO API calls to populate the > >> input data. > >> > >> I know it might be more difficult to come up with that well-defined > >> input interface than to just make DB or API calls whenever you need > >> data in the library, but I think it would be well worth the effort for > >> those reasons above. > >> > >> - Rawlin > >> > >> On Thu, Aug 1, 2019 at 7:28 AM ocket 8888 <[email protected]> wrote: > >>> > >>> Well, in that spirit: > >>> > >>> - Cache-side processing: +1. I suppose given the fact that we wouldn't > >>> want > >>> to rewrite the entire configuration generation logic at once, there's no > >>> reason to prefer this being part of ORT immediately versus separate. > >>> Either > >>> way, there's no real "extra step". Though I must admit, I am sad to see > >>> this written in Go and not built into my baby: ORT.py > >>> > >>> - Invert LoC for config update: +1 because this essentially lets you do > >>> server configuration snapshots for free, in addition to the other > >>> benefits, > >>> and that's a pretty requested feature, I think. For servers that are > >>> unreachable from TO, there's a problem, and after backing off for some set > >>> number of retries they should probably just be marked offline with reason > >>> e.g. "unable to contact for updates". This is a bit off-topic, though, > >>> imo, > >>> because it's sort of independent from what Rob's suggesting, and that > >>> utility isn't even really designed with that in mind - at least just yet. > >>> A > >>> conversation for after it's finished. > >>> > >>> - Invert LoC for data selection: +1 (I think). Letting the cache server > >>> decide what it needs to know allows you to decouple it from any particular > >>> cache server, which would let us support things like Grove or NGinx more > >>> easily. Or at least allow people to write their own config generators > >>> (plugins?) for those servers. Though honestly, it's probably always going > >>> to want the entire profile/parameter set and information for assigned > >>> delivery services anyway, just then it'll decide what's important and > >>> what's meaningless. This is somewhat more related, imo, since I _think_ > >>> (without looking at any code) that what Rob's thing does now is just > >>> request the information it thinks it needs, and builds the configs with > >>> that. I'd be interested to hear more about "fragility when out-of-sync" > >>> thing. Or maybe I'm misunderstanding the concept? If what you mean is > >>> something more like "the cache server selects what specific parameters it > >>> needs" then I'm -1, but you should be able to get all of the parameters > >>> and > >>> their modified dates with one call to `/profiles?name={{name}}` and then > >>> decide from there. So the server still tells you everything that just > >>> changed. Stuff like CG assignment/params and DS assignment/params would > >>> likewise still need to be checked normally. So +1 for caches deciding what > >>> API endpoints to call, -1 for big globs of "this is what was updated" > >>> being > >>> pushed to the cache server, and -1 for cache servers trying to guess what > >>> might have changed instead of checking everything. > >>> > >>> - Direct Database Connection: -1 > >>> - Kafka: -1 > >>> > >>> > >>>> "This is true, but you can also run the cache-config generator to > >>> visually inspect them as well" > >>> > >>> yeah, but then you need to either expose that configuration generator's > >>> output to the internet or you need to `ssh` into a cache server or > >>> something to inspect it individually. I think his concern is not being > >>> able > >>> to see it in Traffic Portal, which is arguably safer and much easier than > >>> the other two options, respectively. > >>> My response would be "but you can still see arbitrary configuration, so > >>> maybe that just needs to be made easier to view and understand to > >>> compensate. Like, instead of > >>> > >>> { > >>> "configFile": "records.config", > >>> "name": "CONFIG proxy.config.http.insert_response_via_str", > >>> "secure": false, > >>> "value": "INT 3" > >>> }, > >>> > >>> you see something like 'encoded Via header verbosity: none/low/med/high'" > >>> > >>> > >>> On Wed, Jul 31, 2019 at 11:01 AM Chris Lemmons <[email protected]> wrote: > >>> > >>>> This is true, but you can also run the cache-config generator to > >>>> visually inspect them as well. That makes it easy to visually inspect > >>>> them as well as to pipe them to diff and mechanically inspect them. So > >>>> we don't lose the ability entirely, we just move it from one place to > >>>> another. > >>>> > >>>> On Wed, Jul 31, 2019 at 10:47 AM Genz, Geoffrey > >>>> <[email protected]> wrote: > >>>>> > >>>>> A small point, but TO currently allows one to visually inspect/validate > >>>> the generated configuration files. I don't know how critical that > >>>> functionality is (I personally found it invaluable when testing logging > >>>> configuration changes), but it seems like we either have the generation > >>>> logic in two places (ORT and TO), or we lose that ability in TO by moving > >>>> all the logic to the cache. > >>>>> > >>>>> - Geoff > >>>>> > >>>>> On 7/31/19, 10:33 AM, "Jeremy Mitchell" <[email protected]> wrote: > >>>>> > >>>>> my feedback: > >>>>> > >>>>> 1. i like the idea of slimming down TO. It's gotten way too fat. > >>>> Basically > >>>>> deprecating these api endpoints at some point and letting "something > >>>> else" > >>>>> do the job of config generation: > >>>>> > >>>>> GET /api/$version/servers/#id/configfiles/ats > >>>>> GET /api/$version/profiles/#id/configfiles/ats/#filename > >>>>> GET /api/$version/servers/#id/configfiles/ats/#filename > >>>>> GET /api/$version/cdns/#id/configfiles/ats/#filename > >>>>> > >>>>> 2. i don't really care if that "something else" is a sidecar to ORT > >>>> or > >>>>> actually ORT. will let you guys hash that out. > >>>>> > >>>>> 3. i like the idea of that "something else" eventually being able to > >>>> handle > >>>>> a push vs. a pull as rawlin suggested. > >>>>> > >>>>> 4. a bit curious how "cache snapshots" would work as rob suggested in > >>>>> > >>>> https://cwiki.apache.org/confluence/display/TC/Cache-Side+Config+Generation > >>>> - > >>>>> would you look at a cache snapshot diff and then snapshot (which > >>>> would > >>>>> queue updates in the background)? > >>>>> > >>>>> otherwise, thanks for taking the initiative, rob. and looking > >>>> forward to > >>>>> seeing what comes of this that will make TC safer/more efficient. > >>>>> > >>>>> jeremy > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Jul 31, 2019 at 9:20 AM Gray, Jonathan < > >>>> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> Smaller, simpler pieces closer to the cache that do one job are far > >>>>>> simpler to maintain, triage, and build. I'm not a fan of trying > >>>> to inject > >>>>>> a message bus in the middle of everything. > >>>>>> > >>>>>> Jonathan G > >>>>>> > >>>>>> > >>>>>> On 7/31/19, 8:48 AM, "Genz, Geoffrey" <[email protected]> > >>>> wrote: > >>>>>> > >>>>>> To throw a completely different idea out there . . . some time > >>>> ago > >>>>>> Matt Mills was talking about using Kafka as the configuration > >>>> transport > >>>>>> mechanism for Traffic Control. The idea is to use a Kafka > >>>> compacted topic > >>>>>> as the configuration source. TO would write database updates to > >>>> Kafka, and > >>>>>> the ORT equivalent would pull its configuration from Kafka. > >>>>>> > >>>>>> To explain compacted topics a bit, a standard Kafka message is > >>>> a key > >>>>>> and a payload; in a compacted topics, only the most recent > >>>> message/payload > >>>>>> with a particular key is kept. As a result, reading all the > >>>> messages from > >>>>>> a topic will give you the current state of what's basically a key > >>>> value > >>>>>> store, with the benefit of not doing actual mutations of data. So > >>>> a cache > >>>>>> could get the full expected configuration by reading all the > >>>> existing > >>>>>> messages on the appropriate topic, as well as get new updates to > >>>>>> configuration by listening for new Kafka messages. > >>>>>> > >>>>>> This leaves the load on the Kafka brokers, which I can assure > >>>> you > >>>>>> given recent experience, is minimal. TO would only have the > >>>> responsibility > >>>>>> of writing database updates to Kafka, ORT only would need to read > >>>>>> individual updates (and be smart enough to know how and when to > >>>> apply them > >>>>>> -- perhaps hints could be provided in the payload?). The result > >>>> is TO is > >>>>>> "pushing" updates to the caches (via Kafka) as Rawlin was > >>>> proposing, and > >>>>>> ORT could still pull the full configuration whenever necessary > >>>> with no hit > >>>>>> to Postgres or TO. > >>>>>> > >>>>>> Now this is obviously a radical shift (and there are no doubt > >>>> other > >>>>>> ways to implement the basic idea), but It seemed worth bringing up. > >>>>>> > >>>>>> - Geoff > >>>>>> > >>>>>> On 7/31/19, 8:30 AM, "Lavanya Bathina" <[email protected]> > >>>> wrote: > >>>>>> > >>>>>> +1 on this > >>>>>> > >>>>>> On Jul 30, 2019, at 6:01 PM, Rawlin Peters < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>> I've been thinking for a while now that ORT's current > >>>> pull-based > >>>>>> model > >>>>>> of checking for queued updates is not really ideal, and I > >>>> was > >>>>>> hoping > >>>>>> with "ORT 2.0" that we would switch that paradigm around > >>>> to where > >>>>>> TO > >>>>>> itself would push updates out to queued caches. That way > >>>> TO would > >>>>>> never get overloaded because we could tune the level of > >>>> concurrency > >>>>>> for pushing out updates (based on server capacity/specs), > >>>> and we > >>>>>> would > >>>>>> eliminate the "waiting period" between the time updates > >>>> are queued > >>>>>> and > >>>>>> the time ORT actually updates the config on the cache. > >>>>>> > >>>>>> I think cache-side config generation is a good idea in > >>>> terms of > >>>>>> enabling canary deployments, but as CDNs continue to scale > >>>> by > >>>>>> adding > >>>>>> more and more caches, we might want to get out ahead of > >>>> the ORT > >>>>>> load/waiting problem by flipping that paradigm from "pull" > >>>> to > >>>>>> "push" > >>>>>> somehow. Then instead of 1000 caches all asking TO the same > >>>>>> question > >>>>>> and causing 1000 duplicated reads from the DB, TO would > >>>> just read > >>>>>> the > >>>>>> one answer from the DB and send it to all the caches, > >>>> further > >>>>>> reducing > >>>>>> load on the DB as well. The data in the "push" request > >>>> from TO to > >>>>>> ORT > >>>>>> 2.0 would contain all the information ORT would request > >>>> from the > >>>>>> API > >>>>>> itself, not the actual config files. > >>>>>> > >>>>>> With the API transition from Perl to Go, I think we're > >>>> eliminating > >>>>>> the > >>>>>> Perl CPU bottleneck from TO, but the next bottleneck seems > >>>> like it > >>>>>> would be reading from the DB due to the constantly growing > >>>> number > >>>>>> of > >>>>>> concurrent ORT requests as a CDN scales up. We should keep > >>>> that in > >>>>>> mind for whatever "ORT 2.0"-type changes we're making so > >>>> that it > >>>>>> won't > >>>>>> make flipping that paradigm around even harder. > >>>>>> > >>>>>> - Rawlin > >>>>>> > >>>>>>> On Tue, Jul 30, 2019 at 4:23 PM Robert Butts < > >>>> [email protected]> > >>>>>> wrote: > >>>>>>> > >>>>>>>> I'm confused why this is separate from ORT. > >>>>>>> > >>>>>>> Because ORT does a lot more than just fetching config > >>>> files. > >>>>>> Rewriting all > >>>>>>> of ORT in Go would be considerably more work. > >>>> Contrawise, if we > >>>>>> were to put > >>>>>>> the config generation in the ORT script itself, we would > >>>> have to > >>>>>> write it > >>>>>>> all from scratch in Perl (the old config gen used the > >>>> database > >>>>>> directly, > >>>>>>> it'd still have to be rewritten) or Python. This was > >>>> just the > >>>>>> easiest path > >>>>>>> forward. > >>>>>>> > >>>>>>>> I feel like this logic should just be replacing the > >>>> config > >>>>>> fetching logic > >>>>>>> of ORT > >>>>>>> > >>>>>>> That's exactly what it does: the PR changes ORT to call > >>>> this app > >>>>>> instead of > >>>>>>> calling Traffic Ops over HTTP: > >>>>>>> > >>>>>> > >>>> https://github.com/apache/trafficcontrol/pull/3762/files#diff-fe8a3eac71ee592a7170f2bdc7e65624R1485 > >>>>>>> > >>>>>>>> Is that the eventual plan? Or does our vision of the > >>>> future > >>>>>> include this > >>>>>>> *and* ORT? > >>>>>>> > >>>>>>> I reserve the right to develop a strong opinion about > >>>> that in > >>>>>> the future. > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Jul 30, 2019 at 3:17 PM ocket8888 < > >>>> [email protected]> > >>>>>> wrote: > >>>>>>> > >>>>>>>>> "I'm just looking for consensus that this is the right > >>>>>> approach." > >>>>>>>> > >>>>>>>> Umm... sort of. I think moving cache configuration to > >>>> the cache > >>>>>> itself > >>>>>>>> is a great idea, > >>>>>>>> > >>>>>>>> but I'm confused why this is separate from ORT. Like if > >>>> this is > >>>>>> going to > >>>>>>>> be generating the > >>>>>>>> > >>>>>>>> configs and it's already right there on the server, I > >>>> feel like > >>>>>> this > >>>>>>>> logic should just be > >>>>>>>> > >>>>>>>> replacing the config fetching logic of ORT (and > >>>> personally I > >>>>>> think a > >>>>>>>> neat place to try it > >>>>>>>> > >>>>>>>> out would be in ORT.py). > >>>>>>>> > >>>>>>>> > >>>>>>>> Is that the eventual plan? Or does our vision of the > >>>> future > >>>>>> include this > >>>>>>>> *and* ORT? > >>>>>>>> > >>>>>>>> > >>>>>>>>> On 7/30/19 2:15 PM, Robert Butts wrote: > >>>>>>>>> Hi all! I've been working on moving the ATS config > >>>> generation > >>>>>> from > >>>>>>>> Traffic > >>>>>>>>> Ops to a standalone app alongside ORT, that queries the > >>>>>> standard TO API > >>>>>>>> to > >>>>>>>>> generate its data. I just wanted to put it here, and > >>>> get some > >>>>>> feedback, > >>>>>>>> to > >>>>>>>>> make sure the community agrees this is the right > >>>> direction. > >>>>>>>>> > >>>>>>>>> There's a (very) brief spec here: (I might put more > >>>> detail > >>>>>> into it later, > >>>>>>>>> let me know if that's important to anyone) > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>> https://cwiki.apache.org/confluence/display/TC/Cache-Side+Config+Generation > >>>>>>>>> > >>>>>>>>> And the Draft PR is here: > >>>>>>>>> https://github.com/apache/trafficcontrol/pull/3762 > >>>>>>>>> > >>>>>>>>> This has a number of advantages: > >>>>>>>>> 1. TO is a monolith, this moves a significant amount > >>>> of logic > >>>>>> out of it, > >>>>>>>>> into a smaller per-cache app/library that's easier to > >>>> test, > >>>>>> validate, > >>>>>>>>> rewrite, deploy, canary, rollback, etc. > >>>>>>>>> 2. Deploying cache config changes is much smaller and > >>>> safer. > >>>>>> Instead of > >>>>>>>>> having to deploy (and potentially roll back) TO, you > >>>> can > >>>>>> canary deploy on > >>>>>>>>> one cache at a time. > >>>>>>>>> 3. This makes TC more cache-agnostic. It moves cache > >>>> config > >>>>>> generation > >>>>>>>>> logic out of TO, and into an independent app/library. > >>>> The app > >>>>>> (atstccfg) > >>>>>>>> is > >>>>>>>>> actually very similar to Grove's config generator > >>>>>> (grovetccfg). This > >>>>>>>> makes > >>>>>>>>> it easier and more obvious how to write config > >>>> generators for > >>>>>> other > >>>>>>>> proxies. > >>>>>>>>> 4. By using the API and putting the generator > >>>> functions in a > >>>>>> library, > >>>>>>>> this > >>>>>>>>> really gives a lot more flexibility to put the config > >>>> gen > >>>>>> anywhere you > >>>>>>>> want > >>>>>>>>> without too much work. You could easily put it in an > >>>> HTTP > >>>>>> service, or > >>>>>>>> even > >>>>>>>>> put it back in TO via a Plugin. That's not something > >>>> that's > >>>>>> really > >>>>>>>> possible > >>>>>>>>> with the existing system, generating directly from the > >>>>>> database. > >>>>>>>>> > >>>>>>>>> Right now, I'm just looking for consensus that this is > >>>> the > >>>>>> right > >>>>>>>> approach. > >>>>>>>>> Does the community agree this is the right direction? > >>>> Are > >>>>>> there concerns? > >>>>>>>>> Would anyone like more details about anything in > >>>> particular? > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> >
