There’s an old saying that there are only two hard problems in computer science: naming, caching, and off-by-one errors. Caching of computations is basically a classic time-space tradeoff, but caching is also used for other purposes: for fault-tolerance, to trade off network bandwidth against storage, for privacy, and to improve scalability by minimizing the work done at a potential hotspot in a distributed system, among other things.
I propose that delegating almost all caching of computations to a dedicated computation-caching subsystem would have substantial benefits for computer system design, enabling transparent parallelism, network transparency, and fault tolerance, while improving performance and substantially simplifying application code. Examples of caching of computation ---------------------------------- Current computer systems use many layers of caching, implemented in many different ways, many of them just to cache computations. ### MVC ### One broad generalization is that views in MVC applications contain cached representations of the state of the models. That’s why they get notified when the model changes: so they can update their representation. (Rails programmers are probably confused here, because what Rails calls “MVC” doesn’t work this way.) ### In a terminal text editor ### For example, I’m editing this text file in the vim editor, running inside a terminal emulator displaying in X-Windows. #### Buffer contents are a cached computation on command history #### As I edit, the vim editor executes a computation on the file’s initial contents and my sequence of commands to get the buffer’s new contents, and stores the result of this computation in the file every time I type :w. You could think of the file’s contents as being a cached result of executing the entire editing history. #### Text screen contents are a cached computation on buffer contents #### Also, my terminal emulator’s memory contains a cached representation of the visible part of the buffer’s contents. When the buffer’s contents change, this cache becomes invalid and needs to be updated; vim, using curses, carefully calculates a minimal sequence of edit operations to be applied to its local image of my screen contents to make them match the file, and encodes that sequence as a string and sends it to the terminal emulator. If the two replicas of this cache get out of sync, for example because some other program wrote some text onto my terminal, the desynchronized state can last quite some time; vim provides an explicit command, ^L, to request resynchronization by resending the entire screen contents instead of an incremental update. #### Framebuffer contents are a cached computation on text and font #### The framebuffer *also* contains a cached representation of those same screen contents, and the same kind of crazy synchronization dance ensues between the terminal emulator and the X-Windows server, which manages the framebuffer contents. However, the X server is a lot more competent at preventing desynchronization, because it speaks a more featureful protocol to its applications, such as the terminal emulator, than the terminal emulator does to its applications, such as vim. #### X-Windows apps cache information about the X server #### At a more detailed level of the X11 protocol, when the terminal emulator app starts up, it gets information about the X server’s screen configuration (available visuals, root window ID, and so on) which it then caches for the rest of its run, using for all future drawing commands. There is in general no way to notify the app that these cached values have become invalid and need to be refetched. This is one reason that it’s relatively straightforward to detach an ANSI-character-terminal app from one ANSI terminal emulator and reattach it to another one, but essentially impossible to do the same thing for an X-Windows application. #### Font bitmaps are cached renderings of the font #### When the terminal emulator sends update commands to the X server, it requests the drawing of strings of characters in a specified font at specified locations. The process of rendering a character from a font can be sufficiently time-consuming to be a performance bottleneck, especially when a screen frame can have tens of thousands of them at once; so the X server has a special-purpose cache for rendered glyphs from fonts. But wait! The operations that use that special-purpose cache don’t support alpha-blending. And the terminal emulator I’m using at the moment uses alpha-blending to get antialiased rendering of the fonts it uses, so it uses the X Render Extension, driven by a library called Xft, to manage its very own cache of font glyphs in the X server. (I don’t know if this cache is shared between X applications.) #### Compiled programs are cached results of a computation on source code #### Also, the code for the terminal emulator, the X server, Xft, and vim is all written in C. Compiling this amount of C code into machine code, so you can execute it, used to be time-consuming in the 1980s, and you can maybe still amp up your compiler optimizations to the point where it’s time-consuming today, so the executable code for those programs is cached. In fact, this particular caching system is so complicated that it has a world-wide organization consisting of thousands of people largely to manage it, called “Debian”. It includes components such as `make`, object and executable file formats, binary packages for different architectures, the linker (as distinct from the compiler), the dynamic linker, the configuration for the dynamic linker, and sometimes, the “derived-object wink-in” feature of ClearCase, and the closely analogous features of the Vesta version control system. (Debian actually spend a lot of time on things other than compilation and managing compilation output. They also archive and version source code, track and fix bugs, test software, establish interoperability standards, document undocumented software, and so on. But having to struggle with the build infrastructure costs them a lot of time that could be spent doing valuable work like the above.) #### X11 apps also cache Compose key bindings #### Another thing the terminal emulator caches is my [Compose key bindings][xcompose]. The X input method linked into it reads the `.XCompose` file at startup, apparently in order to build a table of compose-key sequences, and then never rereads it. This means that I have to kill and restart my terminals to find out if a new binding I’ve added to the file works properly, or start a different terminal process, such as xterm; and I can’t take advantage of new bindings until I restart the terminal. On the other hand, I can also test the new bindings without worrying that I’ll break my keyboard setup. [xcompose]: https://github.com/kragen/xcompose #### `gnome-terminal` caches GConf settings #### The terminal emulator application, `gnome-terminal`, chooses the font based on a “profile” stored in GConf, which is managed by `gconfd`, but naturally cached within the terminal emulator process in a `TerminalProfile` object. (You can have several different profiles for different kinds of terminals.) This means that the font-glyph cache mentioned earlier is a cache of, among other things, configuration information stored in GConf. GConf supports asynchronous event notification, so that if you change, say, `/apps/gnome-terminal/profiles/Default/font` using `gconf-editor,` a callback fires to update the `TerminalProfile` in each of your open terminals using the Default profile, and they will immediately redraw with the new font. However, this cache invalidation mechanism is limited; as explained in the comments: * It's somewhat tricky to manage the profiles/ dir since we need to track * the list of profiles, but gconf doesn't have a concept of notifying that * a directory has appeared or disappeared. #### `gnome-terminal` shares cache by running many windows in a process #### All of the computation and I/O that `gnome-terminal` does at startup, including reading profiles from GConf, reading my Compose key bindings, rendering fonts to images, and many other things, adds up to about two seconds of CPU time on my netbook. In order to cache all of this for future windows, `gnome-terminal` (like `dtterm`, the forgotten CDE terminal program) runs all terminal windows from a single process; at startup it uses D-BUS to check to see if there’s an existing terminal emulator process running to which it can pass its command-line arguments, which takes less than 200ms. So D-BUS is here being used as part of a system for searching for existing cached computation results, in the form of a running process. #### And that’s only the beginning #### There are several other layers here that I’m not getting into at all: GDK, GTK+, Pango, Cairo, and D-BUS, among others. Each of these has its own state, most of which is cached results of previous computations that could be recomputed if necessary — either in principle or in fact. Consequences of caching of computation -------------------------------------- As illustrated above, cached computation results are pervasive in software at all different scales, from single bytes to multi-gigabyte software distributions; this is essential to make software acceptably efficient; they are managed by a mélange of different mechanisms; deficiencies in these caching mechanisms introduce a huge number of bugs and inflexibilities into the software; and the caching mechanisms themselves are a huge amount of code, with its attendant maintenance cost. Something that may not be obvious is that this diversity of caching mechanisms inevitably leads to irrational caching tradeoffs and poor systemwide performance. This is because these time-space tradeoffs are being made in isolation from one another, and the time savings being bought by a megabyte of RAM varies wildly from one part of the system to another. Consequently, we have some parts of the system that are unnecessarily slow because they don’t cache sufficiently, and other parts of the system that are unnecessarily memory-hogging because they cache too much — causing everything else to run more slowly. Worse, the value of each cached item — the amount of computation it saves, in the end — can’t be evaluated in isolation. It depends on how often that item is needed, which is much more common if the stuff that is computed from it isn’t cached effectively; and it depends on how much computation is needed to compute it, which depends on how effectively the things it’s computed from are cached. So picking the ideal set of computations to cache, even given perfect knowledge of future demands, is a global optimization problem which can’t be solved by composing local solutions to subproblems. The alternative: a unified cache system --------------------------------------- So suppose that, instead of a mélange of different caching mechanisms, we have a single systemwide caching mechanism. All stored data is classified either as “source data” — which came from outside the computer system and thus cannot be regenerated if it is erased — and “derived data” or “cached computation”, which is anything that could be regenerated by deterministic computation from the source data. The systemwide caching mechanism knows every cached computational result in the system, what the inputs to the computation were (both source inputs and derived inputs), and how long it took. And it can use an intelligent systemwide policy to figure out which cache results are likely to save the most computation time in the future, taking into account how long they took to compute and what other previously-cached results would need to be recomputed. Like `make`, it can also automatically invalidate cached results when they change — whether they are compiled object code, font glyphs, or screen images. ### The caching system interface replaces the filesystem interface ### To work reliably, such a caching system must mediate all access to persistently stored data on the computer system, or at least the part of the system it controls, whether it’s cached computation or received through I/O. All of the “application code” in the system is in effect side-effect-free, or pure functional, even if it uses side effects in its internal implementation. It becomes a filesystem, like Vesta, but more than a filesystem, since it’s also responsible for I/O. And by virtue of mediating all persistent data access, like MapReduce, it can achieve transparent parallelism, transparent network distribution, and transparent fault-tolerance, by running the computations whose results are to be cached on whatever computer it wants, and possibly on multiple computers in case one of them crashes or runs slowly. (The caching system looks a lot like a software transactional memory system.) ### But the first step is to boil the ocean. ### The biggest disadvantage is that it requires you to rewrite all of your software with a different fundamental design to get these benefits. (There are other disadvantages as well: worst-case performance would be under the control of the caching system, rather than user code. Whether such a system’s policies could be tuned for competitive performance across a wide range of domains remains to be seen. My intuition says yes, but we’d need real systems to prove it.) However, lots of existing program code could still be used inside such a system, just as it you can use existing compilers in Vesta. ### There’s a mountain of related work. ### I described a hypothetical [concrete example of such a system in a specific domain][compositing] in 2005. That post contains references to a lot of related ideas, which I won’t repeat here. [compositing]: http://lists.canonical.org/pipermail/kragen-tol/2005-November/000810.html "Make for URLs, or dependency-directed compositing" Another example of such a system in a specific domain is the polynomial greedy algorithm for OLAP materialized view selection. “Materialized views” of OLAP “data cubes” are cached computation results, and while they can be computed from the underlying data cube, they can be computed much more efficiently from other materialized views of the same cube. Like other database query results, though, there are a number of different possible cached values that could be taken advantage of if they existed. `make` of course is such a dependency-driven system operating at large scale — computations that take seconds to hours, specifically source-code compilation. Various successor systems to `make`, like DSEE and Vesta, have taken advantage of the side-effect-free nature of compilation that `make` is built on to get some of the other benefits I’m touting for this approach: transparent network distribution and parallelism. `distcc` is probably the currently most popular such system. `ccache` is a C-compiler wrapper that uses a different management policy for compilation output than `make` does, one that can use more space and be more efficient. ### There's a range of choices where the imperative nature of input ends. ### Maybe, in such a system, my text editor would still interpret my key commands to imperatively update the state of a persistent buffer object and a persistent file. But all of the redisplay logic — going from the buffer contents, cursor position, scrollbar position, and terminal preferences, to the pixels on the screen, including font rendering — would be mediated by the caching subsystem; so it could be transparently distributed in a fault-tolerant manner, and it would automatically update when any of the dependencies changed. Or maybe, as in [rumor-oriented programming][rop], even the current file contents could be merely a cached result of running a computation on the previous file contents and an editing command. However, making that practical probably requires sharing structure among different versions of the file contents, thus using something like a “rope” made from immutable nodes, and so might have a significant performance penalty. [rop]: http://lists.canonical.org/pipermail/kragen-tol/2004-January/000749.html ### Unified caching dramatically simplifies eventual consistency. ### As in rumor-oriented programming, another benefit of unified caching is that it eliminates most of the complexity of eventually-consistent database replication. The difficulty with eventually-consistent replication is that almost any state transition is possible; if your calendaring system is built on top of an eventually-consistent store, you can ensure that you don’t book a new meeting in the same time slot as an existing meeting, but you might end up with two meetings booked into the same time slot anyway, because you booked them on different computers that later synchronized with each other. This kind of unexpected transition tends to introduce subtle bugs into application code that expects its underlying data store to change, if at all, only in sensible ways. ### There are many different possible strategies for managing the cache. ### The simplest strategy is probably the `make` approach: recompute chunks of cached data from scratch when any of their dependencies change, but only when they’re being demanded (“lazy”). Or you could use a “push” or “eager” version of that approach, where the recomputation happens as soon as the dependency changes, so that the cached result is instantly available when it’s demanded, but the computation may be wasted. You could save some previous cached values in case inputs change back to a previous state. You could use a delta-application approach instead of recomputing from scratch, where the user code has access to the old input state, the difference between the old and new input state, and the old cached output, and can then use those three things to compute the new output. You could support multiple possible ways to compute the same requested result, so that in cases like database queries or OLAP materialized views, it can choose the computational plans that require the least computation, given the currently-available cached results. You could compare results of a recomputation with new inputs against the previous results in order to see whether recomputation of downstream results can be skipped. Any such strategy should probably strive to evict the things from the cache with the lowest predicted future time-saved-to-space-used ratio. That is, any cache entry can be rated according to how much CPU time it’s expected to save in the future (the benefit), and how much space it has to use for how long in order to do so (the cost), and the cache entries with the lowest benefit-to-cost ratio should be the ones that next get replaced with new cache entries — but only those with a higher benefit-to-cost ratio. In effect, memory is being auctioned off to the different cache entries. This is complicated by the fact that, as mentioned earlier, the benefit of any given cache entry is contingent on the existence or nonexistence of other cache entries. (This applies to disk usage as well as RAM.) A tricky aspect here is that there is no limit to the amount of computation that could be applied, in principle, to reducing cache miss rates. However, at some point, the cache manager starts consuming more compute time than it’s avoiding. The same is true of code complexity. ### The system should cache things at as fine a granularity as practical. ### Practically speaking, a system like this can only be reasonably efficient down to some granularity — the granularity where the overhead of dependency tracking and cache eviction decisions becomes significant compared to the time spent actually computing something from the data in application code. This is crucial, because caches that are invalidated at a too-large granularity suffer from “false sharing”, where spurious recomputations happen because something changed in an input, but it’s something that a particular computation isn’t actually using. Although this critical granularity depends on the computational intensity of the task, my intuition is that the right granularity for this is typically closer to the database record, line of text, or HTML element, rather than the text file or video frame. Coincidentally, this is close to the cache line size of modern CPUs. This suggests that an interface considerably more efficient than the standard filesystem interface — a library interface, which doesn’t require a context switch, rather than a system-call interface — is needed to reap the full benefits. Acknowledgments --------------- In addition to the people thanked in my 2005 post, I’d like to thank Michael P. Frank and Shae Erisson for helping develop these ideas; some of the above is theirs. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol