In other news, my Hutter prize preprocessor plus a custom ZPAQ model
compresses enwik9 from 1000 MB to 145 MB in 13 minutes using 4.5 GB of
memory, which places it near the Pareto frontier on the large text
benchmark.
https://encode.su/threads/4467-enwik9-preprocessor#post86938

There is only a minor change to the preprocessor in step C. The steps are.
A - article sorting by topic.
B - basic XML decoding to extract text and headers into separate streams.
C - capitalization and space modeling and escape coding of rare characters.
The idea is to split the stream into tokens with independent semantics.
Capital letters are coded as a special character followed by lower case.
Then the first letter after a space is coded as upper case and the space is
removed.
D - dictionary encoding. Each of 256 byte values decodes to a common group
of letters found by byte pair encoding, restricted to parts of a word,
single digit, common punctuation, space (not all are removed), or newline.
This finds common suffixes like -s, -ed, -ing, etc., which are tokens in
themselves.

These steps reduce enwik9 from 1000 MB to 580 MB in about 2 minutes, which
speeds up the downstream context model and reduces memory usage. The output
is then compressed with zpaqd, a ZPAQ development tool that I wrote in 2014
that takes a config file that describes the context mixing architecture and
code in ZPAQL, a virtual machine byte code, to generate the contexts. I
wrote an order 0-1-2-3-4-6 byte ICM-ISSE chain, order 0-1 word chain, match
model, and a final order 0 mixer, whose output is arithmetic coded.

An ICM is an indirect context model. It maps a context to an 8 bit state
representing a count of 0s and 1s and the most recent bit seen in that
context. That state is mapped to a table of predictions that is updated to
reduce the prediction error by 0.1%. An order n context means the last n
whole bytes plus the bits coded so far in the current byte.

An ISSE is an indirect secondary symbol estimator. It mixes the stretched
previous prediction from the next lower order context with the constant 1
by weighted averaging and squashes the output to a prediction in the range
0 to 1. The weight is selected by a hash of the current context and is
adjusted to reduce the prediction error by 0.1%. A prediction is stretched
by x = ln(p/(1-p)) and squashed by the inverse, p = 1/(1 + e^-x). This
makes a mixer a neural network with no hidden layer. In a word chain, the
context is a hash of the previous word (for order 1) and the partial word
bits coded so far, skipping any non letters.

A match model searches for earlier long context matches using a hash table
and predicts whatever bit came next, weighted by the length of the match.

A mixer is a 2 layer neural network (no hidden layer) that weights the
stretched predictions from all the other components and outputs the
squashed weighted sum as the final bit prediction. The weights are updated
to reduce the prediction error. In an order 0 mixer, the weight vector is
selected by the order 0 context including the partial current byte.

The current leader on the large text benchmark is nncp, at 110 MB in 60
hours on an RTX-3090 GPU using a transformer network. The Hutter prize
winner is fx2-cmix at 113 MB including the decompressor executable, limited
to 70 hours and 10 GB in a single thread with no GPU. It is a context
mixing algorithm like mine (using some of my PAQ code) but mixing many
hundreds of models instead of just 10.
https://mattmahoney.net/dc/text.html

-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T0518db1e3a0c25c5-Me78ccf134be65e0b2445bf3c
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to