There’s an old saying that there are only two hard problems in computer
science: naming, caching, and off-by-one errors.  Caching of
computations is basically a classic time-space tradeoff, but caching is
also used for other purposes: for fault-tolerance, to trade off network
bandwidth against storage, for privacy, and to improve scalability by
minimizing the work done at a potential hotspot in a distributed system,
among other things.  

I propose that delegating almost all caching of computations to a
dedicated computation-caching subsystem would have substantial
benefits for computer system design, enabling transparent parallelism,
network transparency, and fault tolerance, while improving performance
and substantially simplifying application code.


Examples of caching of computation
----------------------------------

Current computer systems use many layers of caching, implemented in
many different ways, many of them just to cache computations.

### MVC ###

One broad generalization is that views in MVC applications contain
cached representations of the state of the models.  That’s why they
get notified when the model changes: so they can update their
representation.  (Rails programmers are probably confused here,
because what Rails calls “MVC” doesn’t work this way.)

### In a terminal text editor ###

For example, I’m editing this text file in the vim editor, running
inside a terminal emulator displaying in X-Windows.

#### Buffer contents are a cached computation on command history ####

As I edit, the vim editor executes a computation on the file’s initial
contents and my sequence of commands to get the buffer’s new contents,
and stores the result of this computation in the file every time I
type :w.  You could think of the file’s contents as being a cached
result of executing the entire editing history.

#### Text screen contents are a cached computation on buffer contents ####

Also, my terminal emulator’s memory contains a cached representation of
the visible part of the buffer’s contents.  When the buffer’s contents
change, this cache becomes invalid and needs to be updated; vim, using
curses, carefully calculates a minimal sequence of edit operations to be
applied to its local image of my screen contents to make them match the
file, and encodes that sequence as a string and sends it to the terminal
emulator.  If the two replicas of this cache get out of sync, for
example because some other program wrote some text onto my terminal, the
desynchronized state can last quite some time; vim provides an explicit
command, ^L, to request resynchronization by resending the entire screen
contents instead of an incremental update.

#### Framebuffer contents are a cached computation on text and font ####

The framebuffer *also* contains a cached representation of those same
screen contents, and the same kind of crazy synchronization dance ensues
between the terminal emulator and the X-Windows server, which manages
the framebuffer contents.  However, the X server is a lot more competent
at preventing desynchronization, because it speaks a more featureful
protocol to its applications, such as the terminal emulator, than the
terminal emulator does to its applications, such as vim.

#### X-Windows apps cache information about the X server ####

At a more detailed level of the X11 protocol, when the terminal
emulator app starts up, it gets information about the X server’s
screen configuration (available visuals, root window ID, and so on)
which it then caches for the rest of its run, using for all future
drawing commands.  There is in general no way to notify the app that
these cached values have become invalid and need to be refetched.
This is one reason that it’s relatively straightforward to detach an
ANSI-character-terminal app from one ANSI terminal emulator and
reattach it to another one, but essentially impossible to do the same
thing for an X-Windows application.

#### Font bitmaps are cached renderings of the font ####

When the terminal emulator sends update commands to the X server, it
requests the drawing of strings of characters in a specified font at
specified locations.  The process of rendering a character from a font
can be sufficiently time-consuming to be a performance bottleneck,
especially when a screen frame can have tens of thousands of them at
once; so the X server has a special-purpose cache for rendered glyphs
from fonts.  But wait!  The operations that use that special-purpose
cache don’t support alpha-blending.  And the terminal emulator I’m using
at the moment uses alpha-blending to get antialiased rendering of the
fonts it uses, so it uses the X Render Extension, driven by a library
called Xft, to manage its very own cache of font glyphs in the X server.
(I don’t know if this cache is shared between X applications.)

#### Compiled programs are cached results of a computation on source code ####

Also, the code for the terminal emulator, the X server, Xft, and vim is
all written in C.  Compiling this amount of C code into machine code, so
you can execute it, used to be time-consuming in the 1980s, and you can
maybe still amp up your compiler optimizations to the point where it’s
time-consuming today, so the executable code for those programs is
cached.  In fact, this particular caching system is so complicated that
it has a world-wide organization consisting of thousands of people
largely to manage it, called “Debian”.  It includes components such as
`make`, object and executable file formats, binary packages for different
architectures, the linker (as distinct from the compiler), the dynamic
linker, the configuration for the dynamic linker, and sometimes, the
“derived-object wink-in” feature of ClearCase, and the closely analogous
features of the Vesta version control system.

(Debian actually spend a lot of time on things other than compilation
and managing compilation output.  They also archive and version source
code, track and fix bugs, test software, establish interoperability
standards, document undocumented software, and so on.  But having to
struggle with the build infrastructure costs them a lot of time that
could be spent doing valuable work like the above.)

#### X11 apps also cache Compose key bindings ####

Another thing the terminal emulator caches is my [Compose key
bindings][xcompose].  The X input method linked into it reads the
`.XCompose` file at startup, apparently in order to build a table of
compose-key sequences, and then never rereads it.  This means that I
have to kill and restart my terminals to find out if a new binding I’ve
added to the file works properly, or start a different terminal process,
such as xterm; and I can’t take advantage of new bindings until I
restart the terminal.  On the other hand, I can also test the new
bindings without worrying that I’ll break my keyboard setup.

[xcompose]: https://github.com/kragen/xcompose

#### `gnome-terminal` caches GConf settings ####

The terminal emulator application, `gnome-terminal`, chooses the font
based on a “profile” stored in GConf, which is managed by `gconfd`,
but naturally cached within the terminal emulator process in a
`TerminalProfile` object.  (You can have several different profiles
for different kinds of terminals.)  This means that the font-glyph
cache mentioned earlier is a cache of, among other things,
configuration information stored in GConf.  GConf supports
asynchronous event notification, so that if you change, say,
`/apps/gnome-terminal/profiles/Default/font` using `gconf-editor,` a
callback fires to update the `TerminalProfile` in each of your open
terminals using the Default profile, and they will immediately redraw
with the new font.  However, this cache invalidation mechanism is
limited; as explained in the comments:

     * It's somewhat tricky to manage the profiles/ dir since we need to track
     * the list of profiles, but gconf doesn't have a concept of notifying that
     * a directory has appeared or disappeared.

#### `gnome-terminal` shares cache by running many windows in a process ####

All of the computation and I/O that `gnome-terminal` does at startup,
including reading profiles from GConf, reading my Compose key
bindings, rendering fonts to images, and many other things, adds up to
about two seconds of CPU time on my netbook.  In order to cache all of
this for future windows, `gnome-terminal` (like `dtterm`, the
forgotten CDE terminal program) runs all terminal windows from a
single process; at startup it uses D-BUS to check to see if there’s an
existing terminal emulator process running to which it can pass its
command-line arguments, which takes less than 200ms.  So D-BUS is here
being used as part of a system for searching for existing cached
computation results, in the form of a running process.

#### And that’s only the beginning ####

There are several other layers here that I’m not getting into at all:
GDK, GTK+, Pango, Cairo, and D-BUS, among others.  Each of these has
its own state, most of which is cached results of previous
computations that could be recomputed if necessary — either in
principle or in fact.


Consequences of caching of computation
--------------------------------------

As illustrated above, cached computation results are pervasive in
software at all different scales, from single bytes to multi-gigabyte
software distributions; this is essential to make software acceptably
efficient; they are managed by a mélange of different mechanisms;
deficiencies in these caching mechanisms introduce a huge number of
bugs and inflexibilities into the software; and the caching mechanisms
themselves are a huge amount of code, with its attendant maintenance
cost.

Something that may not be obvious is that this diversity of caching
mechanisms inevitably leads to irrational caching tradeoffs and poor
systemwide performance.  This is because these time-space tradeoffs
are being made in isolation from one another, and the time savings
being bought by a megabyte of RAM varies wildly from one part of the
system to another.  Consequently, we have some parts of the system
that are unnecessarily slow because they don’t cache sufficiently, and
other parts of the system that are unnecessarily memory-hogging
because they cache too much — causing everything else to run more
slowly.

Worse, the value of each cached item — the amount of computation it
saves, in the end — can’t be evaluated in isolation.  It depends on
how often that item is needed, which is much more common if the stuff
that is computed from it isn’t cached effectively; and it depends on
how much computation is needed to compute it, which depends on how
effectively the things it’s computed from are cached.  So picking the
ideal set of computations to cache, even given perfect knowledge of
future demands, is a global optimization problem which can’t be solved
by composing local solutions to subproblems.


The alternative: a unified cache system
---------------------------------------

So suppose that, instead of a mélange of different caching mechanisms,
we have a single systemwide caching mechanism.  All stored data is
classified either as “source data” — which came from outside the
computer system and thus cannot be regenerated if it is erased — and
“derived data” or “cached computation”, which is anything that could
be regenerated by deterministic computation from the source data.

The systemwide caching mechanism knows every cached computational
result in the system, what the inputs to the computation were (both
source inputs and derived inputs), and how long it took.  And it can
use an intelligent systemwide policy to figure out which cache results
are likely to save the most computation time in the future, taking
into account how long they took to compute and what other
previously-cached results would need to be recomputed.  Like `make`,
it can also automatically invalidate cached results when they change —
whether they are compiled object code, font glyphs, or screen images.

### The caching system interface replaces the filesystem interface ###

To work reliably, such a caching system must mediate all access to
persistently stored data on the computer system, or at least the part
of the system it controls, whether it’s cached computation or received
through I/O.  All of the “application code” in the system is in effect
side-effect-free, or pure functional, even if it uses side effects in
its internal implementation.  It becomes a filesystem, like Vesta, but
more than a filesystem, since it’s also responsible for I/O.  And by
virtue of mediating all persistent data access, like MapReduce, it can
achieve transparent parallelism, transparent network distribution, and
transparent fault-tolerance, by running the computations whose results
are to be cached on whatever computer it wants, and possibly on
multiple computers in case one of them crashes or runs slowly.

(The caching system looks a lot like a software transactional memory
system.)

### But the first step is to boil the ocean. ###

The biggest disadvantage is that it requires you to rewrite all of
your software with a different fundamental design to get these
benefits.  (There are other disadvantages as well: worst-case
performance would be under the control of the caching system, rather
than user code.  Whether such a system’s policies could be tuned for
competitive performance across a wide range of domains remains to be
seen.  My intuition says yes, but we’d need real systems to prove it.)

However, lots of existing program code could still be used inside such
a system, just as it you can use existing compilers in Vesta.

### There’s a mountain of related work. ###

I described a hypothetical [concrete example of such a system in a
specific domain][compositing] in 2005.  That post contains references
to a lot of related ideas, which I won’t repeat here.

[compositing]: 
http://lists.canonical.org/pipermail/kragen-tol/2005-November/000810.html "Make 
for URLs, or dependency-directed compositing"

Another example of such a system in a specific domain is the
polynomial greedy algorithm for OLAP materialized view selection.
“Materialized views” of OLAP “data cubes” are cached computation
results, and while they can be computed from the underlying data cube,
they can be computed much more efficiently from other materialized
views of the same cube.  Like other database query results, though,
there are a number of different possible cached values that could be
taken advantage of if they existed.

`make` of course is such a dependency-driven system operating at large
scale — computations that take seconds to hours, specifically
source-code compilation.  Various successor systems to `make`, like
DSEE and Vesta, have taken advantage of the side-effect-free nature of
compilation that `make` is built on to get some of the other benefits
I’m touting for this approach: transparent network distribution and
parallelism. `distcc` is probably the currently most popular such
system.  `ccache` is a C-compiler wrapper that uses a different
management policy for compilation output than `make` does, one that
can use more space and be more efficient.

### There's a range of choices where the imperative nature of input ends. ###

Maybe, in such a system, my text editor would still interpret my key
commands to imperatively update the state of a persistent buffer
object and a persistent file.  But all of the redisplay logic — going
from the buffer contents, cursor position, scrollbar position, and
terminal preferences, to the pixels on the screen, including font
rendering — would be mediated by the caching subsystem; so it could be
transparently distributed in a fault-tolerant manner, and it would
automatically update when any of the dependencies changed.

Or maybe, as in [rumor-oriented programming][rop], even the current
file contents could be merely a cached result of running a computation
on the previous file contents and an editing command.  However, making
that practical probably requires sharing structure among different
versions of the file contents, thus using something like a “rope” made
from immutable nodes, and so might have a significant performance
penalty.

[rop]: http://lists.canonical.org/pipermail/kragen-tol/2004-January/000749.html

### Unified caching dramatically simplifies eventual consistency. ###

As in rumor-oriented programming, another benefit of unified caching
is that it eliminates most of the complexity of eventually-consistent
database replication.  The difficulty with eventually-consistent
replication is that almost any state transition is possible; if your
calendaring system is built on top of an eventually-consistent store,
you can ensure that you don’t book a new meeting in the same time slot
as an existing meeting, but you might end up with two meetings booked
into the same time slot anyway, because you booked them on different
computers that later synchronized with each other.  This kind of
unexpected transition tends to introduce subtle bugs into application
code that expects its underlying data store to change, if at all, only
in sensible ways.

### There are many different possible strategies for managing the cache. ###

The simplest strategy is probably the `make` approach: recompute
chunks of cached data from scratch when any of their dependencies
change, but only when they’re being demanded (“lazy”).

Or you could use a “push” or “eager” version of that approach, where
the recomputation happens as soon as the dependency changes, so that
the cached result is instantly available when it’s demanded, but the
computation may be wasted.

You could save some previous cached values in case inputs change back
to a previous state.

You could use a delta-application approach instead of recomputing from
scratch, where the user code has access to the old input state, the
difference between the old and new input state, and the old cached
output, and can then use those three things to compute the new output.

You could support multiple possible ways to compute the same requested
result, so that in cases like database queries or OLAP materialized
views, it can choose the computational plans that require the least
computation, given the currently-available cached results.  

You could compare results of a recomputation with new inputs against
the previous results in order to see whether recomputation of
downstream results can be skipped.

Any such strategy should probably strive to evict the things from the
cache with the lowest predicted future time-saved-to-space-used ratio.
That is, any cache entry can be rated according to how much CPU time
it’s expected to save in the future (the benefit), and how much space
it has to use for how long in order to do so (the cost), and the cache
entries with the lowest benefit-to-cost ratio should be the ones that
next get replaced with new cache entries — but only those with a
higher benefit-to-cost ratio.  In effect, memory is being auctioned
off to the different cache entries.  This is complicated by the fact
that, as mentioned earlier, the benefit of any given cache entry is
contingent on the existence or nonexistence of other cache entries.

(This applies to disk usage as well as RAM.)

A tricky aspect here is that there is no limit to the amount of
computation that could be applied, in principle, to reducing cache
miss rates.  However, at some point, the cache manager starts
consuming more compute time than it’s avoiding.  The same is true of
code complexity.

### The system should cache things at as fine a granularity as practical. ###

Practically speaking, a system like this can only be reasonably
efficient down to some granularity — the granularity where the
overhead of dependency tracking and cache eviction decisions becomes
significant compared to the time spent actually computing something
from the data in application code.  This is crucial, because caches
that are invalidated at a too-large granularity suffer from “false
sharing”, where spurious recomputations happen because something
changed in an input, but it’s something that a particular computation
isn’t actually using.

Although this critical granularity depends on the computational
intensity of the task, my intuition is that the right granularity for
this is typically closer to the database record, line of text, or HTML
element, rather than the text file or video frame.  Coincidentally,
this is close to the cache line size of modern CPUs.

This suggests that an interface considerably more efficient than the
standard filesystem interface — a library interface, which doesn’t
require a context switch, rather than a system-call interface — is
needed to reap the full benefits.

Acknowledgments
---------------

In addition to the people thanked in my 2005 post, I’d like to thank
Michael P. Frank and Shae Erisson for helping develop these ideas;
some of the above is theirs.
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

Reply via email to