In other news, my Hutter prize preprocessor plus a custom ZPAQ model compresses enwik9 from 1000 MB to 145 MB in 13 minutes using 4.5 GB of memory, which places it near the Pareto frontier on the large text benchmark. https://encode.su/threads/4467-enwik9-preprocessor#post86938
There is only a minor change to the preprocessor in step C. The steps are. A - article sorting by topic. B - basic XML decoding to extract text and headers into separate streams. C - capitalization and space modeling and escape coding of rare characters. The idea is to split the stream into tokens with independent semantics. Capital letters are coded as a special character followed by lower case. Then the first letter after a space is coded as upper case and the space is removed. D - dictionary encoding. Each of 256 byte values decodes to a common group of letters found by byte pair encoding, restricted to parts of a word, single digit, common punctuation, space (not all are removed), or newline. This finds common suffixes like -s, -ed, -ing, etc., which are tokens in themselves. These steps reduce enwik9 from 1000 MB to 580 MB in about 2 minutes, which speeds up the downstream context model and reduces memory usage. The output is then compressed with zpaqd, a ZPAQ development tool that I wrote in 2014 that takes a config file that describes the context mixing architecture and code in ZPAQL, a virtual machine byte code, to generate the contexts. I wrote an order 0-1-2-3-4-6 byte ICM-ISSE chain, order 0-1 word chain, match model, and a final order 0 mixer, whose output is arithmetic coded. An ICM is an indirect context model. It maps a context to an 8 bit state representing a count of 0s and 1s and the most recent bit seen in that context. That state is mapped to a table of predictions that is updated to reduce the prediction error by 0.1%. An order n context means the last n whole bytes plus the bits coded so far in the current byte. An ISSE is an indirect secondary symbol estimator. It mixes the stretched previous prediction from the next lower order context with the constant 1 by weighted averaging and squashes the output to a prediction in the range 0 to 1. The weight is selected by a hash of the current context and is adjusted to reduce the prediction error by 0.1%. A prediction is stretched by x = ln(p/(1-p)) and squashed by the inverse, p = 1/(1 + e^-x). This makes a mixer a neural network with no hidden layer. In a word chain, the context is a hash of the previous word (for order 1) and the partial word bits coded so far, skipping any non letters. A match model searches for earlier long context matches using a hash table and predicts whatever bit came next, weighted by the length of the match. A mixer is a 2 layer neural network (no hidden layer) that weights the stretched predictions from all the other components and outputs the squashed weighted sum as the final bit prediction. The weights are updated to reduce the prediction error. In an order 0 mixer, the weight vector is selected by the order 0 context including the partial current byte. The current leader on the large text benchmark is nncp, at 110 MB in 60 hours on an RTX-3090 GPU using a transformer network. The Hutter prize winner is fx2-cmix at 113 MB including the decompressor executable, limited to 70 hours and 10 GB in a single thread with no GPU. It is a context mixing algorithm like mine (using some of my PAQ code) but mixing many hundreds of models instead of just 10. https://mattmahoney.net/dc/text.html -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T0518db1e3a0c25c5-Me78ccf134be65e0b2445bf3c Delivery options: https://agi.topicbox.com/groups/agi/subscription
