One of the basic principles of the [FONC/STEPS system] [STEPS] is that
you can reduce the overall complexity of software by choosing powerful
primitives that simplify many parts of a system.  OMeta is sort of their
poster child: a generic rewriting system powerful enough to write
simple, efficient parsers with, but also to write compiler optimizers
and code generators with.

[STEPS]: http://vpri.org/fonc_wiki/index.php/Glossary

Here are some candidates for other primitives which, in practice,
simplify other parts of a computing system by more than the complexity they
add.

LZ77 obsoletes curses and other things
--------------------------------------

Using zlib to compress your data streams and files greatly reduces the
performance penalty for not optimizing your file formats and protocols
to reduce file and message size.  Here are some exmaples.


### curses ###

At the Barracas [hacklab][] tonight, I was talking about why curses, the
programming library, is useless today.  Curses goes to a lot of effort
to minimize the number of bytes sent to your terminal, because at 300 or
2400 or 9600 baud, redrawing an entire 80×24 screen would take 66 or 8
or 2 seconds.  So it sought the minimum amount of change to your screen
to get it to match its internal screen image.

[hacklab]: http://lab.hackcoop.com.ar/

Fast-forwarding to 2012, if I’m receiving those bytes over a
connection at all, they’re being sent over an SSH connection
compressed with zlib.  If you send the same screen image twice with
small differences, zlib will notice that you’re sending a blob of
bytes you’ve sent before, and just transmit a reference to the earlier
index in the bytestream where you sent those bytes before.  What’s
more, it will also compress runs of whitespace, words that were
mentioned before, things that were on previous screens, and so on.  It
actually does a much better job than curses at reducing the bandwidth
needed.

So basically all of the cleverness that went into the special-purpose
screen-update optimization algorithms in curses (and Emacs, and other
things) has been replaced by a more general-purpose algorithm that
actually works better.  It requires a bit more CPU, and can’t be
executed by a nonprogrammable terminal, but that’s okay.

(There’s a separate question of how important it is to conserve the
bandwidth of text login sessions when we have megabits instead of
kilobits.)


### SNMP and BER in general are ad-hoc compression that doesn’t work ###

Similarly, it turns out that if you represent the contents of a
typical SNMP packet as simple ASCII text with explicit name-value
pairs (with short names), and then compress it with either zlib LZ77
or the LZW implementation in `compress`, it ends up about the same
length as the BER-represented format that SNMP actually uses.

The following [sample packet][] is 53 bytes in BER. 

[sample packet]: http://www.pccl.demon.co.uk/papers/snmpdecode.html

    302B            SEQUENCE (LEN = 43)
    0201 00                 Version 1
    0404 55544D43           Community = UTMC
    A320                    SetRequestPDU (context constructed 3) (Len = 32)
    0204 023712FB                   REQID = 0X023712FB
    0201 00                         Error = NONE
    0201 00                         ErrIndex = 0
    3012                            SEQUENCE (LEN = 18)
    3010                                    SEQUENCE (LEN = 16)
    060B 512C010401010401010100                     OID = 
2.1.44.1.4.1.1.4.1.1.1.0
    0201 01                                         Integer = 1

The following version is 45 bytes in ASCII; `compress` makes it 44 bytes,
`gzip -9c` makes it 54 bytes; `bzip2` makes it 79 bytes.

    UTMC
    023712FB
    set 2.1.44.1.4.1.1.4.1.1.1.0=1

If we run it through a stupid expansion phase (“binary dump”) where we write
out two bytes per bit, it becomes 721 bytes, which gzip down to 106 bytes,
bzip2 down to 99 bytes, or LZW compress down to 138 bytes.  So even fairly
extreme inefficiencies in representational choice only cost a few dozen bytes
of overhead.

This is nearly a best case.  SNMP packets in the wild commonly contain
multiple similar OIDs, long sequences of zero bytes, and similar
highly-compressible nonsense.  (To say nothing of endianness errors,
signedness errors, and integer truncation errors.)

The ad-hoc compression in SNMP is similar to the ad-hoc compression in
DNS, which would also be well-served by an ASCII text protocol with
optional compression.


### LBX was a waste of effort because LZ77 works better ###

Similarly, X11R6.3, one of the versions of X-Windows in the 1990s,
added a protocol extension called LBX, Low-Bandwidth X.  A great deal
of ingenuity went into reducing the size of messages in the X11
protocol, and an extension was added to the X server.  It met the same
fate; mostly quoting Wikipedia here:

> LBX was never widely deployed as it did not offer significant speed
> improvements. The slow links it was introduced to help were
> typically insecure, and RFB (VNC) over a SSH connection — which
> includes compression — proved faster than LBX, and also provided
> session resumption.
> 
> Finally, it was shown that greater speed improvements to X could be
> obtained for all networked environments with replacement of X’s
> antiquated font system...

Unfortunately, I don’t think ASN.1 BER or curses are likely to meet
the same fate as LBX any time soon.  But if you were designing a
system today, with the knowledge of LZW and LZ77, you could avoid them
entirely.


### Git uses compression instead of storing file diffs ###

Obsolete version-control systems like RCS and Subversion use the
worst-case O(N²) longest-common-subsequence algorithm to determine the
smallest possible edit to get from the current version of a file to its
previous version, or vice versa, so that storing all the past versions
of a file takes only slightly more space than storing the current
version of it.  This complicates file renaming and fails to save any
space when you're using the wrong newline convention.  However, it means
that the diffs are already computed and, in theory, available for use in
merging branches.

Git, instead, concatenates many different versions of many different
files into a “packfile”, sorted by a crude measure of similarity, and
compresses it with LZ77.  This turns out to produce a smaller total
filesize, and it eliminates the coupling between the storage format and
merge algorithms that has plagued many obsolete version control systems.


### Complexity and adaptability to pathological cases ###

LZW is only about 50 lines of code, which is a lot less than LZ77, and
it has less overhead than `gzip`, but neither it nor Burrows-Wheeler do
as well as LZ77 compression in extreme cases; the Gnumeric file
mentioned above is 9kB compressed with `compress`.  Some screen updates
I tested --- concatenating slightly modified copies of a 2.5K screen
image --- were an average of 140 bytes each with `gzip`, but 1200 bytes
each with `compress` and 270 bytes each with `bzip2`.


### Summary ###

So, to a great extent, you can forget about the space-efficiency of your file
formats and wire formats if you run them through a generic compression
algorithm as a last step, and optimize them entirely for readability,
extensibility, and simplicity.  Gnumeric takes this approach.  A particular
Gnumeric spreadsheet I have handy here, representing a screenful of text, is
2.7kB on disk; `gzip -d` on it produces 32kB of XML.


SAT solvers and theorem provers
-------------------------------

I assert without proof that a substantial chunk of compiler
optimizations could be solved with a general-purpose SAT solver, and
even more with a general-purpose theorem prover.


Suffix arrays provide searching, sorting, indexing, and compression
-------------------------------------------------------------------

Computer systems have a lot of code in them to do search.  LZ77
implementations must search for previous occurrences of the current
input string within their window.  Compilers and linkers must search
for symbols in symbol tables.  Databases must search by various keys,
including occasionally finding keys within ranges.  Full-text search
engines must search for terms that occur anywhere within documents.
Filesystems must search for filenames within a directory.


### Full-text indices subsume single-column RDBMS indices ###

There’s a small software company in Florida called askSam, which sells
a piece of software also called “askSam”, billed as a “free-form
database”.  Basically you divide your data into records.  Your records
are basically strings of text, but there’s a convention for “fields”:
a field named FOO is the text following the string “FOO[” up to the
next “]”.  This sounds kind of hokey, but with a little bit of
full-text indexing, an awful lot of queries get performance comparable
to the performance they’d get in a traditional relational database
with, you know, columns and shit.  (There are some queries that don’t,
of course, like multicolumn queries.)

So, in a sense, full-text indices subsume the more normal kinds of
database indices as a special case.  (Except for multicolumn indices,
and even for single-column queries, they’re probably worse by constant
factors.)


### Suffix arrays ###

One particularly interesting kind of full-text index is the suffix
array.  In a suffix array, you consider your entire corpus (database)
as one long string, and the index is an array of integers, one integer
per index point, and that integer is the offset into the string of
that index point.  (In the usual case of indexing a document full of
words, you might have an index point at the beginning of each word; if
you want to support arbitrary substring search, you might instead have
an index point at every character.)  The tricky part is that the
integers are sorted in the *lexicographical order of the suffixes of
the corpus starting at the points they point to*.  Which means that
all the index points followed by, say‚ “walnuts” are in a single run
in the array.  So you can binary-search in the array to find all the
occurrences of a word, and the first thing alphabetically following
it.  (There’s a trick called an LCP-LR array that stores some extra
information so that you can find arbitrarily long strings in
essentially linear time.)

It turns out that there’s a linear-time algorithm for constructing a
suffix array in about 50 lines of code, called the “skew algorithm”,
discovered by Kärkkäinen and Sanders.  (This might sound like it’s
asymptotically better than the O(N lg N) needed to construct a regular
database column index, but that’s sort of an illusion.  I’d be very
surprised if it turned out to be faster in practice.  But I don’t
expect it to be many times slower or asymptotically slower, either.)

Another feature of suffix arrays is that they give you the suffixes
beginning with a given string in *sorted* order.  That means that if
you have a big text file, you can find the range of its suffix array
that begins with '\n' and iterate over it, and you have the lines of
the text file in sorted order.  (Except for the first.)  And if you
want your emails sorted by subject, you can get most of the way there
by looking at the suffixes of your email file that begin with
'\nSubject:' --- then you just need to search from there to the
beginning of each email, which can be done relatively efficiently.

So what would it mean if you could replace most sorting and searching
code with the use of a single suffix-array primitive?  Things would be
a constant factor slower, but how much slower?


### Burrows-Wheeler compression is easy once you have a suffix array ###

Also, revisiting the subject of generic compression algorithms, there's
Burrows-Wheeler compression, which is commonly substantially more
efficient than LZW or LZ77, the hard part of which is generating a sort
of suffix array.  Perhaps if you already have an efficient suffix array
generation algorithm, implementing Burrows-Wheeler compression and
decompression might be easier than implementing LZ77 and almost as easy
as implementing LZW.  (You still need the other algorithms, though,
because Burrows-Wheeler isn't suitable for compressing streams such as
remote desktop sessions, and it does much worse than LZ77 at compressing
extremely repetitive things like Git packfiles or sequences of screen
images.)

There are “compressed full-text indices” or “self-indexes” based on the
Burrows-Wheeler transform that simultaneously provide the search functionality
of suffix arrays and the compressed-text-retrieval functionality of BWT
compression — in effect, the overhead of indexing your text is negative 60%.
So far I’ve been too intimidated by Navarro and Mäkinen's highly-cited 2006
[survey paper on the subject][Navarro] (67 pages long, including a five-page
bibliography) to actually read it, so I don’t know if they have the kind of
power-to-complexity ratio that the other things in here have.

[Navarro]: ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survcompr.ps.gz


### Replacing hash tables ###

It may be a little hard to see how you could replace hash tables in
general with suffix arrays, since hash tables often contain references
to garbage-collected objects in memory, not just text strings.  You
could generalize your strings to contain object references as well as
characters, which doesn’t cause any problems for the skew algorithm as
long as there’s a well-defined ordering for object references; you could
end up building strings like this:

    main=*
    usage=#
    strerror=&

where the punctuation characters are actually references to objects.
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

Reply via email to