One of the basic principles of the [FONC/STEPS system] [STEPS] is that you can reduce the overall complexity of software by choosing powerful primitives that simplify many parts of a system. OMeta is sort of their poster child: a generic rewriting system powerful enough to write simple, efficient parsers with, but also to write compiler optimizers and code generators with.
[STEPS]: http://vpri.org/fonc_wiki/index.php/Glossary Here are some candidates for other primitives which, in practice, simplify other parts of a computing system by more than the complexity they add. LZ77 obsoletes curses and other things -------------------------------------- Using zlib to compress your data streams and files greatly reduces the performance penalty for not optimizing your file formats and protocols to reduce file and message size. Here are some exmaples. ### curses ### At the Barracas [hacklab][] tonight, I was talking about why curses, the programming library, is useless today. Curses goes to a lot of effort to minimize the number of bytes sent to your terminal, because at 300 or 2400 or 9600 baud, redrawing an entire 80×24 screen would take 66 or 8 or 2 seconds. So it sought the minimum amount of change to your screen to get it to match its internal screen image. [hacklab]: http://lab.hackcoop.com.ar/ Fast-forwarding to 2012, if I’m receiving those bytes over a connection at all, they’re being sent over an SSH connection compressed with zlib. If you send the same screen image twice with small differences, zlib will notice that you’re sending a blob of bytes you’ve sent before, and just transmit a reference to the earlier index in the bytestream where you sent those bytes before. What’s more, it will also compress runs of whitespace, words that were mentioned before, things that were on previous screens, and so on. It actually does a much better job than curses at reducing the bandwidth needed. So basically all of the cleverness that went into the special-purpose screen-update optimization algorithms in curses (and Emacs, and other things) has been replaced by a more general-purpose algorithm that actually works better. It requires a bit more CPU, and can’t be executed by a nonprogrammable terminal, but that’s okay. (There’s a separate question of how important it is to conserve the bandwidth of text login sessions when we have megabits instead of kilobits.) ### SNMP and BER in general are ad-hoc compression that doesn’t work ### Similarly, it turns out that if you represent the contents of a typical SNMP packet as simple ASCII text with explicit name-value pairs (with short names), and then compress it with either zlib LZ77 or the LZW implementation in `compress`, it ends up about the same length as the BER-represented format that SNMP actually uses. The following [sample packet][] is 53 bytes in BER. [sample packet]: http://www.pccl.demon.co.uk/papers/snmpdecode.html 302B SEQUENCE (LEN = 43) 0201 00 Version 1 0404 55544D43 Community = UTMC A320 SetRequestPDU (context constructed 3) (Len = 32) 0204 023712FB REQID = 0X023712FB 0201 00 Error = NONE 0201 00 ErrIndex = 0 3012 SEQUENCE (LEN = 18) 3010 SEQUENCE (LEN = 16) 060B 512C010401010401010100 OID = 2.1.44.1.4.1.1.4.1.1.1.0 0201 01 Integer = 1 The following version is 45 bytes in ASCII; `compress` makes it 44 bytes, `gzip -9c` makes it 54 bytes; `bzip2` makes it 79 bytes. UTMC 023712FB set 2.1.44.1.4.1.1.4.1.1.1.0=1 If we run it through a stupid expansion phase (“binary dump”) where we write out two bytes per bit, it becomes 721 bytes, which gzip down to 106 bytes, bzip2 down to 99 bytes, or LZW compress down to 138 bytes. So even fairly extreme inefficiencies in representational choice only cost a few dozen bytes of overhead. This is nearly a best case. SNMP packets in the wild commonly contain multiple similar OIDs, long sequences of zero bytes, and similar highly-compressible nonsense. (To say nothing of endianness errors, signedness errors, and integer truncation errors.) The ad-hoc compression in SNMP is similar to the ad-hoc compression in DNS, which would also be well-served by an ASCII text protocol with optional compression. ### LBX was a waste of effort because LZ77 works better ### Similarly, X11R6.3, one of the versions of X-Windows in the 1990s, added a protocol extension called LBX, Low-Bandwidth X. A great deal of ingenuity went into reducing the size of messages in the X11 protocol, and an extension was added to the X server. It met the same fate; mostly quoting Wikipedia here: > LBX was never widely deployed as it did not offer significant speed > improvements. The slow links it was introduced to help were > typically insecure, and RFB (VNC) over a SSH connection — which > includes compression — proved faster than LBX, and also provided > session resumption. > > Finally, it was shown that greater speed improvements to X could be > obtained for all networked environments with replacement of X’s > antiquated font system... Unfortunately, I don’t think ASN.1 BER or curses are likely to meet the same fate as LBX any time soon. But if you were designing a system today, with the knowledge of LZW and LZ77, you could avoid them entirely. ### Git uses compression instead of storing file diffs ### Obsolete version-control systems like RCS and Subversion use the worst-case O(N²) longest-common-subsequence algorithm to determine the smallest possible edit to get from the current version of a file to its previous version, or vice versa, so that storing all the past versions of a file takes only slightly more space than storing the current version of it. This complicates file renaming and fails to save any space when you're using the wrong newline convention. However, it means that the diffs are already computed and, in theory, available for use in merging branches. Git, instead, concatenates many different versions of many different files into a “packfile”, sorted by a crude measure of similarity, and compresses it with LZ77. This turns out to produce a smaller total filesize, and it eliminates the coupling between the storage format and merge algorithms that has plagued many obsolete version control systems. ### Complexity and adaptability to pathological cases ### LZW is only about 50 lines of code, which is a lot less than LZ77, and it has less overhead than `gzip`, but neither it nor Burrows-Wheeler do as well as LZ77 compression in extreme cases; the Gnumeric file mentioned above is 9kB compressed with `compress`. Some screen updates I tested --- concatenating slightly modified copies of a 2.5K screen image --- were an average of 140 bytes each with `gzip`, but 1200 bytes each with `compress` and 270 bytes each with `bzip2`. ### Summary ### So, to a great extent, you can forget about the space-efficiency of your file formats and wire formats if you run them through a generic compression algorithm as a last step, and optimize them entirely for readability, extensibility, and simplicity. Gnumeric takes this approach. A particular Gnumeric spreadsheet I have handy here, representing a screenful of text, is 2.7kB on disk; `gzip -d` on it produces 32kB of XML. SAT solvers and theorem provers ------------------------------- I assert without proof that a substantial chunk of compiler optimizations could be solved with a general-purpose SAT solver, and even more with a general-purpose theorem prover. Suffix arrays provide searching, sorting, indexing, and compression ------------------------------------------------------------------- Computer systems have a lot of code in them to do search. LZ77 implementations must search for previous occurrences of the current input string within their window. Compilers and linkers must search for symbols in symbol tables. Databases must search by various keys, including occasionally finding keys within ranges. Full-text search engines must search for terms that occur anywhere within documents. Filesystems must search for filenames within a directory. ### Full-text indices subsume single-column RDBMS indices ### There’s a small software company in Florida called askSam, which sells a piece of software also called “askSam”, billed as a “free-form database”. Basically you divide your data into records. Your records are basically strings of text, but there’s a convention for “fields”: a field named FOO is the text following the string “FOO[” up to the next “]”. This sounds kind of hokey, but with a little bit of full-text indexing, an awful lot of queries get performance comparable to the performance they’d get in a traditional relational database with, you know, columns and shit. (There are some queries that don’t, of course, like multicolumn queries.) So, in a sense, full-text indices subsume the more normal kinds of database indices as a special case. (Except for multicolumn indices, and even for single-column queries, they’re probably worse by constant factors.) ### Suffix arrays ### One particularly interesting kind of full-text index is the suffix array. In a suffix array, you consider your entire corpus (database) as one long string, and the index is an array of integers, one integer per index point, and that integer is the offset into the string of that index point. (In the usual case of indexing a document full of words, you might have an index point at the beginning of each word; if you want to support arbitrary substring search, you might instead have an index point at every character.) The tricky part is that the integers are sorted in the *lexicographical order of the suffixes of the corpus starting at the points they point to*. Which means that all the index points followed by, say‚ “walnuts” are in a single run in the array. So you can binary-search in the array to find all the occurrences of a word, and the first thing alphabetically following it. (There’s a trick called an LCP-LR array that stores some extra information so that you can find arbitrarily long strings in essentially linear time.) It turns out that there’s a linear-time algorithm for constructing a suffix array in about 50 lines of code, called the “skew algorithm”, discovered by Kärkkäinen and Sanders. (This might sound like it’s asymptotically better than the O(N lg N) needed to construct a regular database column index, but that’s sort of an illusion. I’d be very surprised if it turned out to be faster in practice. But I don’t expect it to be many times slower or asymptotically slower, either.) Another feature of suffix arrays is that they give you the suffixes beginning with a given string in *sorted* order. That means that if you have a big text file, you can find the range of its suffix array that begins with '\n' and iterate over it, and you have the lines of the text file in sorted order. (Except for the first.) And if you want your emails sorted by subject, you can get most of the way there by looking at the suffixes of your email file that begin with '\nSubject:' --- then you just need to search from there to the beginning of each email, which can be done relatively efficiently. So what would it mean if you could replace most sorting and searching code with the use of a single suffix-array primitive? Things would be a constant factor slower, but how much slower? ### Burrows-Wheeler compression is easy once you have a suffix array ### Also, revisiting the subject of generic compression algorithms, there's Burrows-Wheeler compression, which is commonly substantially more efficient than LZW or LZ77, the hard part of which is generating a sort of suffix array. Perhaps if you already have an efficient suffix array generation algorithm, implementing Burrows-Wheeler compression and decompression might be easier than implementing LZ77 and almost as easy as implementing LZW. (You still need the other algorithms, though, because Burrows-Wheeler isn't suitable for compressing streams such as remote desktop sessions, and it does much worse than LZ77 at compressing extremely repetitive things like Git packfiles or sequences of screen images.) There are “compressed full-text indices” or “self-indexes” based on the Burrows-Wheeler transform that simultaneously provide the search functionality of suffix arrays and the compressed-text-retrieval functionality of BWT compression — in effect, the overhead of indexing your text is negative 60%. So far I’ve been too intimidated by Navarro and Mäkinen's highly-cited 2006 [survey paper on the subject][Navarro] (67 pages long, including a five-page bibliography) to actually read it, so I don’t know if they have the kind of power-to-complexity ratio that the other things in here have. [Navarro]: ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survcompr.ps.gz ### Replacing hash tables ### It may be a little hard to see how you could replace hash tables in general with suffix arrays, since hash tables often contain references to garbage-collected objects in memory, not just text strings. You could generalize your strings to contain object references as well as characters, which doesn’t cause any problems for the skew algorithm as long as there’s a well-defined ordering for object references; you could end up building strings like this: main=* usage=# strerror=& where the punctuation characters are actually references to objects. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol