logical versus physical "logging" in network protocol design

Kragen Javier Sitaker Thu, 15 Feb 2007 00:37:06 -0800

Database systems normally use one (or both) of two different systems
to record transaction changes for later rollback or commit: "logical
logging", where the change is recorded, and "physical logging", where
the before and after states of the affected part of the table are
recorded.  For example, inserting a row into a table might result in a
single logical-log entry with the inserted values, or it might result
in one or more physical-log entries with the new contents of the table
page where it was inserted, and possibly other table pages that it
pushed other records into.


This distinction is not limited to database transaction recovery logs;
it applies to any kind of persistence at all.  For example, you can
save the contents of a text file either in the form of each of the
physical lines you want to see on the screen, in the order that you
want to see them, or in the form of a sequence of insertions and
deletions (and maybe moves) to get from an empty file to the current
contents of the text file.  That's roughly how darcs works.

In the case of persistence, the tradeoff is fairly simple: storing the
current state of things is simpler, uses an amount of space that does
not grow over time, potentially uses more I/O bandwidth for small
changes (you have to rewrite the whole file to insert a line at the
beginning), but it doesn't allow you to go back in time to recover
from problems or to learn more about the history of the file.

This distinction also applies to network protocols.  If there's some
piece of state that's maintained in one location, and other things are
accessing that state across the network to change it, you can either
transmit the new state, or the desired change to the state.  In this
case, the tradeoffs are somewhat more complex.

"High REST" comes down firmly on the side of transmitting the new
state, in an HTTP PUT or DELETE request.  The reason given is that, if
you never get a response (say, because you get out of range of the
Wi-Fi access point), it might be that your request was lost, or it
might be that the server handled it.  So it's nice to be able to
retransmit the request automatically, in order to deal with failures
like this, and you can do that with new-state-transmission requests,
because they are idempotent.

And, if you're the only person who might possibly be changing the
state of the resource, or anything else that the resource's state
might semantically depend on, that's all the concurrency control you
need, and that's really the right thing.

(In some non-HTTP contexts, there's another potential advantage to
this approach: it's possible for intermediaries to drop some update
messages if they know they have been superseded by other updates for
the same resource.  Linux's "epoll" family of system calls, for
example, uses a variant of this approach to ensure that it never has
to lose a notification.)

But if other things in the universe might change the state of the
resource --- for example, a timeout, or another person, or yourself on
another machine with a working network connection --- you need some
way to ensure that the change still makes sense at the time the server
applies it.

These concurrency-control mechanisms essentially fall into the two
categories of pessimistic and optimistic concurrency control;
optimistic concurrency control rejects your change request if its
preconditions are not met, probably because something changed before
the change request reached the server, and pessimistic concurrency
control allows you to "acquire locks" to ensure that the preconditions
stay true for some period of time so that your change request is
guaranteed to be sensible.  Pure pessimistic concurrency control is
rarely sensible when the thing "holding the locks" might fail
independently of the thing the locks limit access to, which is why
it's usually backed up with some form of optimistic concurrency
control in distributed systems, for example, in the form of leases
that limit the lifetime of locks.  Java uses pure pessimistic
concurrency control within the VM, which is why Thread.stop(Throwable)
is now deprecated.

There is really very little machinery for this kind of concurrency
control in HTTP.  WebDAV [RFC 2518] adds locks with timeouts to HTTP,
and it is not a coincidence that WebDAV is one of the few widespread
uses of HTTP PUT.

The more common practice on the Web, sometimes known as "Low REST", is
to send the logical change across the network instead of the new state
of the resource --- in a POST request.  This has the drawback that the
request cannot safely be retransmitted, since it may not be
idempotent.

However, it has the advantage that all the concurrency problems are
local to the server --- they don't appear in the network protocol at
all.  The server can handle the update requests in a purely serial
fashion (this used to be common in the days of mSQL, and is now coming
back into style with FastCGI-based sites, many of them in Rails,
although it's less relevant now), rely on PostgreSQL transactions, use
Java's threads and locks, or whatever, but you still don't have to
deal with a heterogeneous distributed concurrency problem with large
latency and partial failure.  Usually this is a win, especially for
small web sites.  (If you need a server farm to run your web site, you
may still have to deal with some of these issues.)

By the same token, it puts all the code to construct new versions of
resources on the server side.  Before about 2002, this was a big
advantage, since if you wanted to run something on the client side,
you had to choose between JavaScript, a language that didn't work, and
Java, a language that took a lot of work to do simple things in.  On
the server side, on the other hand, you could run anything you liked.
Most people ran Perl, but a lot of things were in Python, Tcl, Visual
Basic, PHP/FI (as PHP was called then), and the like.

Most resources on web sites can be modified by more than one person,
and many of them often are modified by other people.  On the other
hand, network failure is still fairly rare, and in the early days of
the web, before SLIP and PPP became widespread (let alone Wi-Fi), it
was very rare indeed.  So it's rare that using PUT instead of POST
buys you anything.

The information transmitted in a POST is a logical-log-like request to
perform some human-comprehensible operation: to add a comment to a
blogpost, to set your rating for a user script, to update a certain
section of a Wikipedia page from a previous state to a new state.
(Notice the hybrid nature of this last one.)  Because these operations
often remain meaningful even if other things change before they are
carried out, they offer more scope for concurrency and conflict
resolution than a physical-log-style PUT.  More concretely, if two
people post comments on this blog post, while I add a category to it
and Ryan adds another category to it, those operations can all be sent
to the server without any of us seeing the other people's changes
ahead of time, and everything will just work.

A few years ago, I wrote about a speculative architecture for
distributed programs that I called "rumor-oriented programming".  The
basic idea is that you record nothing persistently except for a
sequence of HTTP-POST-level changes, and then you define your program
in terms of SQL-like queries (perhaps with automatically-maintained
materialized views) on the entire history of all the user interface
actions; the back-end data store is append-only, so it demands little
in the way of sophisticated concurrency control.  In theory, this
could make disconnected operation and synchronization work
automatically with little hassle.  I've written one or two small
programs this way, and it does seem to work reasonably well; it
remains to be seen whether it would work well in larger programs
maintained over longer periods of time.

logical versus physical "logging" in network protocol design

Reply via email to