If all the binaries are compiled and shipped together all the time, I could go either way. The main gain I can see though would be in debugging and test scope with small binaries. It's easier to say affirmatively that change A's scope of effect is limited to one testable object of 4, then it is to wonder if the monolith has some other dependent codepaths that have to be checked as well. Having one binary to rule them all I think fosters a human habit of scope creep to just make it do one more thing instead of focusing on a specific set of jobs. I'm a huge fan of adding more responsibility on the system operators to use their native toolsets to facilitate several of the jobs ORT has traditionally done. That helps the project lower its overall maintenance obligation and provides greater flexibility so it's easier to break into new environment configurations.
I'm also not a fan (-1) on push instead of pull. It trades the DDoS problem you mention for having to manage all the orchestration surrounding when things apply and what happens in a whole new set of error cases where a push message gets missed in the network somewhere. Even if you think of a message bus of some kind makes it better, that just adds another layer of complexity and fault domain to the overall solution. A fast-enough poll is also indistinguishable from push. Instead, I think it's more worth looking at how to "take the mass out of the hammer". We're making significant strides to reduce our most expensive queries now, and that's only going to get better with flexible cachegroups. Http caching could get us a very long way for things like making ORT take a smaller resource hit or making TP more responsive. If the database queries are still too much, we could look at splitting read queries off onto a separate connection string for multiple RO replicas. Jonathan G On 4/13/20, 4:46 PM, "Rawlin Peters" <[email protected]> wrote: I'm generally +1 on redesigning ORT with the removal of the features you mentioned, but the one thing that worries me is the number of unique binaries/executables involved (potentially 11). Communicating between 11 different processes via stdin/stdout and exit codes, even if the processes themselves are relatively simple, is fairly complex as a whole. IMO I don't really see a problem with implementing it as a single well-designed binary -- if it's Go, each proposed binary could just be its own package instead, with each package only exporting one high-level function. The main func would then be the "Aggregator" that simply calls each package's public function in turn, passing the output of one into the input of the next, checking for errors at each step. I think that would make it much easier to debug and test as a whole. I would also like to bring up the idea that we really need to change ORT's "pull" paradigm, or at least make the "pull" more efficient so that we don't have thousands of ORT instances all making the same requests to TO, with TO having to hit the DB for every request even though nothing has actually changed. Since we control ORT we have nearly 100% of control over all TO API requests made, yet we have a design that self-DDOSes itself by default right now. Do we want to tackle that problem as part of this redesign, or is that out of scope? - Rawlin On Thu, Apr 9, 2020 at 4:57 PM Robert O Butts <[email protected]> wrote: > > I've made a Blueprint proposing to rewrite ORT: > https://urldefense.com/v3/__https://github.com/apache/trafficcontrol/pull/4628__;!!CQl3mcHX2A!WP8MIrdRGn9EvXJUOSFoKai78dFn2hTY6cWc-BQ29yg69KNi_bYeuPFZaKxRSgsU2s3r$ > > If you have opinions on ORT, please read and provide feedback. > > In a nutshell, it's proposing to rewrite ORT in Go, in the "UNIX > Philosophy" of small, "do one thing" apps. > > Importantly, the proposal **removes** the following ORT features: > > chkconfig - CentOS 7+ and SystemD don't use chkconfig, and moreover our > default Profile runlevel is wrong and broken. But my knowledge of > CentOS,SystemD,chkconfig,runlevels isn't perfect, if I'm mistaken about > this and you're using ORT to set chkconfig, please let us know ASAP. > > ntpd - ORT today has code to set ntpd config and restart the ntpd service. > I have no idea why it was ever in charge of this, but this clearly seems to > be the system's job, not ORT or TC's. > > interactive mode - I asked around, and couldn't find anyone using this. > Does anyone use it? And feel it's essential to keep in ORT? And also feel > that the way this proposal breaks up the app so that it's easy to request > and compare files before applying them isn't sufficient? > > reval mode - This was put in because ORT was slow. ORT in master now takes > 10-20s on our large CDN. Moreover, "reval" mode is no longer significantly > faster than just applying everything. Does anyone feel otherwise? > > report mode - The functionality here is valuable. But intention here is to > replace "ORT report mode" with a pipelined set of app calls or a script to > do the same thing. I.e. because it's "UNIX-Style" you can just "ort-to-get > | ort-make-configs | ort-diff". > > package installation - This is the biggest feature the proposal removes, > and probably the most controversial. The thought is: this isn't something > ORT or Traffic Control should be doing. The same thing that manages the > physical machine and/or operating system -- whether that's Ansible, Puppet, > Chef, or a human System Administrator -- should be installing the OS > packages for ATS and its plugins, just like it manages all the other > packages on your system. ORT and TC should deploy configuration, not > install things. > > So yeah, feedback welcome. Feel free to post it on the list here or the > blueprint PR on github. > > Thanks,
