Re: [EXTERNAL] Re: Cache-Side Config Generation

Chris Lemmons Wed, 31 Jul 2019 09:56:58 -0700

It looks like we've got two, maybe three, possibly separate ideas going on here.


The first is cache-side processing of the data. For this, TO's role
would shift from providing config files to providing data that is used
to build config files. Rob's suggestion is to implement this a bit at
a time, not unlike how the vampire proxy works. So rather than calling
endpoints directly from ORT, ORT will call a new tool (which is
installed and managed as part of the ORT RPM), which will, for
supported configs, use the TO API to acquire the data necessary for
producing the config, and for unsupported configs, use the existing TO
config endpoints. In this way, new configs can be introduced one at a
time and be tested as part of a roll out.

The second idea is to invert the locus of control for config update.
As it stands, each cache is responsible for determining precisely when
it wants to update. I know many folks use a cron job to run the update
task periodically, but that's but one possible strategy. So this idea
is to move the control for that decision from the cache to TO. So TO
would push information to clients and tell them to update. This would
give TO the ability to rate-limit updates and perform selective
updates to slow-roll a change or canary test it for a bit. There are
some potential downsides here, though, regarding update confirmation,
handling offline or unreachable caches, and dealing with
initialization during the time between config change and full roll
out. It's worth considering some flexibility here, though. Since
caches are "in charge" of the decision to update, they can delegate
that to other systems however they wish. It's not hard to imagine
someone who wants centralized control using ansible or a similar tool
to control updates instead of having them happen automatically via a
cron job.

These are the two main points I think we've got on the table. Somewhat
related to these points are a few other independent ideas:

Inverting the locus of control for transmitted data selection. This is
mostly relevant if we invert the locus of control for config update.
Config updates can be pushed either by affirmatively telling the
client to pull updates or by pushing the update data directly to the
client. Only the server knows what was just changed, so it may be in
the best position to select data that the client will certainly need.
But only the client knows what its disk looks like, so only the client
can ensure that it asks for updates for every component that requires
it.

Switching the data transmission mechanism from the TO API to a direct
database connection. This has serious security, backward and forward
compatibility, and data integrity concerns that would need to be
carefully weighed before persuing this option.

Changing the transport from HTTP over TCP to something over Kafka.
Kafka can provide buffering and reliability that TCP can struggle with
at times. It comes with a significant complexity cost, though, and
there's not a great deal of evidence so far that TCP is inadequate for
the purpose.

Having summarized the points, here's where I fall on the ideas:

  - Cache-side processing: The benefits outweigh the costs. The
ability to roll changes to the processor logic selectively is of very
significant value.
  - Invert the LoC for Config Update: The costs outweigh the benefits.
We can get 90% of the value here by simply supporting
If-Modified-Since and If-None-Match in the API. That's pretty easy and
we should definitely do it.
  - Invert the LoC for Data Selection: The server might be able to
reduce the data transmitted, but that will lead to fragility if the
cache ever winds up out-of-sync for any reason. I think the benefit of
config resiliency greatly outweighs the relatively minor benefits of
reduced config processing.
  - Direct Database Connection: I think this creates far more problems
than it solves.
  - Kafka Transport: I don't see evidence that TCP is inadequate, so
the complexity feels unjustified to me. It creates an additional point
of failure which reduces the stability of the system as a whole.
That's ok if the benefit is justified, but I don't think it is here.

On Wed, Jul 31, 2019 at 9:20 AM Gray, Jonathan
<[email protected]> wrote:
>
> Smaller, simpler pieces closer to the cache that do one job are far simpler 
> to maintain, triage, and build.  I'm not a fan of trying to inject a message 
> bus in the middle of everything.
>
> Jonathan G
>
>
> On 7/31/19, 8:48 AM, "Genz, Geoffrey" <[email protected]> wrote:
>
>     To throw a completely different idea out there . . . some time ago Matt 
> Mills was talking about using Kafka as the configuration transport mechanism 
> for Traffic Control.  The idea is to use a Kafka compacted topic as the 
> configuration source.  TO would write database updates to Kafka, and the ORT 
> equivalent would pull its configuration from Kafka.
>
>     To explain compacted topics a bit, a standard Kafka message is a key and 
> a payload; in a compacted topics, only the most recent message/payload with a 
> particular key is kept.  As a result, reading all the messages from a topic 
> will give you the current state of what's basically a key value store, with 
> the benefit of not doing actual mutations of data.  So a cache could get the 
> full expected configuration by reading all the existing messages on the 
> appropriate topic, as well as get new updates to configuration by listening 
> for new Kafka messages.
>
>     This leaves the load on the Kafka brokers, which I can assure you given 
> recent experience, is minimal.  TO would only have the responsibility of 
> writing database updates to Kafka, ORT only would need to read individual 
> updates (and be smart enough to know how and when to apply them -- perhaps 
> hints could be provided in the payload?).  The result is TO is "pushing" 
> updates to the caches (via Kafka) as Rawlin was proposing, and ORT could 
> still pull the full configuration whenever necessary with no hit to Postgres 
> or TO.
>
>     Now this is obviously a radical shift (and there are no doubt other ways 
> to implement the basic idea), but It seemed worth bringing up.
>
>     - Geoff
>
>     On 7/31/19, 8:30 AM, "Lavanya Bathina" <[email protected]> wrote:
>
>         +1 on this
>
>         On Jul 30, 2019, at 6:01 PM, Rawlin Peters <[email protected]> 
> wrote:
>
>         I've been thinking for a while now that ORT's current pull-based model
>         of checking for queued updates is not really ideal, and I was hoping
>         with "ORT 2.0" that we would switch that paradigm around to where TO
>         itself would push updates out to queued caches. That way TO would
>         never get overloaded because we could tune the level of concurrency
>         for pushing out updates (based on server capacity/specs), and we would
>         eliminate the "waiting period" between the time updates are queued and
>         the time ORT actually updates the config on the cache.
>
>         I think cache-side config generation is a good idea in terms of
>         enabling canary deployments, but as CDNs continue to scale by adding
>         more and more caches, we might want to get out ahead of the ORT
>         load/waiting problem by flipping that paradigm from "pull" to "push"
>         somehow. Then instead of 1000 caches all asking TO the same question
>         and causing 1000 duplicated reads from the DB, TO would just read the
>         one answer from the DB and send it to all the caches, further reducing
>         load on the DB as well. The data in the "push" request from TO to ORT
>         2.0 would contain all the information ORT would request from the API
>         itself, not the actual config files.
>
>         With the API transition from Perl to Go, I think we're eliminating the
>         Perl CPU bottleneck from TO, but the next bottleneck seems like it
>         would be reading from the DB due to the constantly growing number of
>         concurrent ORT requests as a CDN scales up. We should keep that in
>         mind for whatever "ORT 2.0"-type changes we're making so that it won't
>         make flipping that paradigm around even harder.
>
>         - Rawlin
>
>         > On Tue, Jul 30, 2019 at 4:23 PM Robert Butts <[email protected]> 
> wrote:
>         >
>         >> I'm confused why this is separate from ORT.
>         >
>         > Because ORT does a lot more than just fetching config files. 
> Rewriting all
>         > of ORT in Go would be considerably more work. Contrawise, if we 
> were to put
>         > the config generation in the ORT script itself, we would have to 
> write it
>         > all from scratch in Perl (the old config gen used the database 
> directly,
>         > it'd still have to be rewritten) or Python. This was just the 
> easiest path
>         > forward.
>         >
>         >> I feel like this logic should just be replacing the config 
> fetching logic
>         > of ORT
>         >
>         > That's exactly what it does: the PR changes ORT to call this app 
> instead of
>         > calling Traffic Ops over HTTP:
>         > 
> https://github.com/apache/trafficcontrol/pull/3762/files#diff-fe8a3eac71ee592a7170f2bdc7e65624R1485
>         >
>         >> Is that the eventual plan? Or does our vision of the future 
> include this
>         > *and* ORT?
>         >
>         > I reserve the right to develop a strong opinion about that in the 
> future.
>         >
>         >
>         > On Tue, Jul 30, 2019 at 3:17 PM ocket8888 <[email protected]> 
> wrote:
>         >
>         >>> "I'm just looking for consensus that this is the right approach."
>         >>
>         >> Umm... sort of. I think moving cache configuration to the cache 
> itself
>         >> is a great idea,
>         >>
>         >> but I'm confused why this is separate from ORT. Like if this is 
> going to
>         >> be generating the
>         >>
>         >> configs and it's already right there on the server, I feel like 
> this
>         >> logic should just be
>         >>
>         >> replacing the config fetching logic of ORT (and personally I think 
> a
>         >> neat place to try it
>         >>
>         >> out would be in ORT.py).
>         >>
>         >>
>         >> Is that the eventual plan? Or does our vision of the future 
> include this
>         >> *and* ORT?
>         >>
>         >>
>         >>> On 7/30/19 2:15 PM, Robert Butts wrote:
>         >>> Hi all! I've been working on moving the ATS config generation from
>         >> Traffic
>         >>> Ops to a standalone app alongside ORT, that queries the standard 
> TO API
>         >> to
>         >>> generate its data. I just wanted to put it here, and get some 
> feedback,
>         >> to
>         >>> make sure the community agrees this is the right direction.
>         >>>
>         >>> There's a (very) brief spec here: (I might put more detail into 
> it later,
>         >>> let me know if that's important to anyone)
>         >>>
>         >> 
> https://cwiki.apache.org/confluence/display/TC/Cache-Side+Config+Generation
>         >>>
>         >>> And the Draft PR is here:
>         >>> https://github.com/apache/trafficcontrol/pull/3762
>         >>>
>         >>> This has a number of advantages:
>         >>> 1. TO is a monolith, this moves a significant amount of logic out 
> of it,
>         >>> into a smaller per-cache app/library that's easier to test, 
> validate,
>         >>> rewrite, deploy, canary, rollback, etc.
>         >>> 2. Deploying cache config changes is much smaller and safer. 
> Instead of
>         >>> having to deploy (and potentially roll back) TO, you can canary 
> deploy on
>         >>> one cache at a time.
>         >>> 3. This makes TC more cache-agnostic. It moves cache config 
> generation
>         >>> logic out of TO, and into an independent app/library. The app 
> (atstccfg)
>         >> is
>         >>> actually very similar to Grove's config generator (grovetccfg). 
> This
>         >> makes
>         >>> it easier and more obvious how to write config generators for 
> other
>         >> proxies.
>         >>> 4. By using the API and putting the generator functions in a 
> library,
>         >> this
>         >>> really gives a lot more flexibility to put the config gen 
> anywhere you
>         >> want
>         >>> without too much work. You could easily put it in an HTTP 
> service, or
>         >> even
>         >>> put it back in TO via a Plugin. That's not something that's really
>         >> possible
>         >>> with the existing system, generating directly from the database.
>         >>>
>         >>> Right now, I'm just looking for consensus that this is the right
>         >> approach.
>         >>> Does the community agree this is the right direction? Are there 
> concerns?
>         >>> Would anyone like more details about anything in particular?
>         >>>
>         >>> Thanks,
>         >>>
>         >>
>
>
>
>

Re: [EXTERNAL] Re: Cache-Side Config Generation

Reply via email to