Re: ORT Rewrite Proposal

Robert O Butts Mon, 13 Apr 2020 18:41:46 -0700

>Communicating between 11 different processes via stdin/stdout and exit
codes, even if the processes themselves are relatively simple, is fairly
complex as a whole.

>I don't really see a problem with implementing it as a single
well-designed binary

It's not that there's a "problem." The question is, if we're doing a
rewrite, which design has more pros and fewer cons?

Personally, I think the UNIX-Style has more pros. Small, self-contained
apps are easier to develop, easier to test, easier to compose, more
powerful, and more flexible.

Self-contained executables for each logical "thing" makes each of those
things much easier to Integration Test itself. Things which would otherwise
be Unit Tests, which are more artificial and test less of the complete
system. While at the same time, Integration Tests of the full pipeline
aren't any harder. The Integration Test Framework can call the Aggregator
or pipeline, just like it would the monolith, and verify the output.

>stdin/stdout

The input and output can be tested just like functions would be, in a
monolith. In fact, functions can be very difficult to test if they have
side effects, which are very easy and natural to write in most
OO/Imperative languages. UNIX-style apps make that impossible; or at least
very difficult. The more natural way of writing apps is making them just
return output, and not change random things on the system, guiding
developers toward more "pure" functions/apps. I can't prove that, of
course; but I really do think small Unix-Style apps guide developers toward
fewer side effects, making testing and verifying correctness much easier. I
think that's another big advantage of the Unix Philosophy.

>exit codes

Exit codes should always be 0 unless there's an error. The tool
(Aggregator) or operator calling the pipeline should return the error if
any component fails (a script can do this with "set -o pipefail"). That
isn't any more difficult to use than a monolith that checks "if err != nil"
for every step.

In fact, it's arguably easier than the monolith, because you don't have to
do the manual error checks for every call, they're automatically propagated
up. If there's an error in a pipeline, the error text returned is returned
for the whole pipeline, and in a test the user is given the exact failure
text to go track down.

UNIX-style executables are easier to develop, too. They can be written in
different languages, if one language is more optimal for a particular task.
And they can be more easily worked on separately by different developers,
with fewer conflicts. It's clearer what each "one thing" is; where in a
single large app, it's very easy to blur lines and make confusing code
around individual things or concepts. And more developers can work on ORT
concurrently, without conflict. Because each person only has to modify
their app's input and output, while everyone else does the other apps, in
parallel.

They're also easier to compose. Operators can call and pipe whatever they
need. Where with a Monolith, only what's exposed is available. For example,
ORT today _could_ expose a diff argument. But it doesn't. Right now,
operators have no way to run ORT's custom diff on their own input. Or,
suppose we didn't expose any kind of "caching," but some TC operator needed
to cache for a minute, to reduce load. With the Monolith, it's just
impossible, unless devs add it, or the operator dives in and modifies the
large and complex ORT codebase. But with UNIX-Style apps, you can write a
quick shell script to call the ort-get-to-data app, save it to a file, if
the file's age is short enough don't call again, and then call the rest of
the pipeline. There are innumerable things like that which are quick and
easy with small apps, and difficult or impossible with a monolith.

>I would also like to bring up the idea that we really need to change
>ORT's "pull" paradigm, or at least make the "pull" more efficient so
>that we don't have thousands of ORT instances all making the same
> requests to TO, with TO having to hit the DB for every request even
>though nothing has actually changed. Since we control ORT we have
> nearly 100% of control over all TO API requests made, yet we have a
>design that self-DDOSes itself by default right now. Do we want to
>tackle that problem as part of this redesign, or is that out of scope?

I would vote out-of-scope, that we discuss that in a different thread as a
different project.

Personally, IMO "pull," the HTTP Client-Server model, advantages far
outweigh "push." The entire internet is built on Client-Server, for good
reason. Push has tons of issues like being Stateful, needing to "register"
clients, needing "brokers," more points of failure. The problem with ORT/TO
scalability is because we don't implement well-known standards for updates,
namely If-Modified-Since et al (I hope the irony isn't lost on anyone). IMO
the solution is to implement IMS on Traffic Ops, and then make the "Traffic
Ops Requestor" do proper IMS requests. Both network and database calls for
IMS are tiny, and it solves the issue without all the disadvantages and
costs of "push."

But I think that project is orthogonal and independent of this.

On Mon, Apr 13, 2020 at 7:06 PM ocket 8888 <[email protected]> wrote:

> For what it's worth, I'd be +1 on re-examining "push" vs "pull" for ORT.
>
> On Mon, Apr 13, 2020, 16:46 Rawlin Peters <[email protected]> wrote:
>
> > I'm generally +1 on redesigning ORT with the removal of the features
> > you mentioned, but the one thing that worries me is the number of
> > unique binaries/executables involved (potentially 11). Communicating
> > between 11 different processes via stdin/stdout and exit codes, even
> > if the processes themselves are relatively simple, is fairly complex
> > as a whole. IMO I don't really see a problem with implementing it as a
> > single well-designed binary -- if it's Go, each proposed binary could
> > just be its own package instead, with each package only exporting one
> > high-level function. The main func would then be the "Aggregator" that
> > simply calls each package's public function in turn, passing the
> > output of one into the input of the next, checking for errors at each
> > step. I think that would make it much easier to debug and test as a
> > whole.
> >
> > I would also like to bring up the idea that we really need to change
> > ORT's "pull" paradigm, or at least make the "pull" more efficient so
> > that we don't have thousands of ORT instances all making the same
> > requests to TO, with TO having to hit the DB for every request even
> > though nothing has actually changed. Since we control ORT we have
> > nearly 100% of control over all TO API requests made, yet we have a
> > design that self-DDOSes itself by default right now. Do we want to
> > tackle that problem as part of this redesign, or is that out of scope?
> >
> > - Rawlin
> >
> > On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <[email protected]> wrote:
> > >
> > > I've made a Blueprint proposing to rewrite ORT:
> > > https://github.com/apache/trafficcontrol/pull/4628
> > >
> > > If you have opinions on ORT, please read and provide feedback.
> > >
> > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX
> > > Philosophy" of small, "do one thing" apps.
> > >
> > > Importantly, the proposal **removes** the following ORT features:
> > >
> > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our
> > > default Profile runlevel is wrong and broken. But my knowledge of
> > > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about
> > > this and you're using ORT to set chkconfig, please let us know ASAP.
> > >
> > > ntpd - ORT today has code to set ntpd config and restart the ntpd
> > service.
> > > I have no idea why it was ever in charge of this, but this clearly
> seems
> > to
> > > be the system's job, not ORT or TC's.
> > >
> > > interactive mode - I asked around, and couldn't find anyone using this.
> > > Does anyone use it? And feel it's essential to keep in ORT? And also
> feel
> > > that the way this proposal breaks up the app so that it's easy to
> request
> > > and compare files before applying them isn't sufficient?
> > >
> > > reval mode - This was put in because ORT was slow. ORT in master now
> > takes
> > > 10-20s on our large CDN. Moreover, "reval" mode is no longer
> > significantly
> > > faster than just applying everything. Does anyone feel otherwise?
> > >
> > > report mode - The functionality here is valuable. But intention here is
> > to
> > > replace "ORT report mode" with a pipelined set of app calls or a script
> > to
> > > do the same thing. I.e. because it's "UNIX-Style" you can just
> > "ort-to-get
> > > | ort-make-configs | ort-diff".
> > >
> > > package installation - This is the biggest feature the proposal
> removes,
> > > and probably the most controversial. The thought is: this isn't
> something
> > > ORT or Traffic Control should be doing. The same thing that manages the
> > > physical machine and/or operating system -- whether that's Ansible,
> > Puppet,
> > > Chef, or a human System Administrator -- should be installing the OS
> > > packages for ATS and its plugins, just like it manages all the other
> > > packages on your system. ORT and TC should deploy configuration, not
> > > install things.
> > >
> > > So yeah, feedback welcome. Feel free to post it on the list here or the
> > > blueprint PR on github.
> > >
> > > Thanks,
> >
>

Re: ORT Rewrite Proposal

Reply via email to