New update compresses enwik9 to 134 MB in 55 minutes using 7 GB memory. This is the top result on the large text benchmark that doesn't use a full size dictionary trained on enwik9. https://encode.su/threads/4467-enwik9-preprocessor#post87460
Mostly it is a speed vs size tradeoff. Mixing more components improves compression but adds time and each component needs memory. I extended the tiny dictionary to match strings in the 36K boilerplate articles on U.S. places that makes up about 15% of enwik9 and improves compression by 1% and took 1.5 days to analyze, code, test, and tune. The rest of the improvement is from the tedious process of making thousands of tiny changes that either make tiny improvements or get discarded. Just today I spent a few hours fiddling with component data representations and learning rates, only to make compression a bit worse and throwing out the code. You might be wondering why I'm not using AI to write the code for my AI program. That's because the purpose is not to produce a product. It is to understand how AI works. I can't do that if I don't write the code myself. It's like learning arithmetic. Using a calculator instead of a pencil is faster but defeats the purpose. -- Matt Mahoney, [email protected] On Tue, Feb 3, 2026, 6:38 PM Matt Mahoney <[email protected]> wrote: > On Tue, Feb 3, 2026, 10:15 AM James Bowery <[email protected]> wrote: > >> >> On Mon, Feb 2, 2026 at 8:56 PM Matt Mahoney <[email protected]> >> wrote: >> >>> I released another update to my Hutter prize entry. >>> https://encode.su/threads/4467-enwik9-preprocessor#post87076 >>> >>> ...This is doable on a neural network with 10^9 parameters because the >>> learning rate is only 5 bits per token for 200M tokens. >>> >> >> "Token" refers to the residual byte pairs from your byte pair coding >> approach? >> > > No, I mean independent semantic units, mostly base words and suffixes. So > "rotations" would be 3 tokens "rotate", "-tion" and "-s". The tiny 256 word > dictionary I currently use codes most of the common suffixes as single > bytes but base words are usually more than one byte. So I either need to > use a larger dictionary coding tokens as either 1 or 2 bytes and hard code > stemming rules, or model tokens in the context models. Most of the top > compressors use the first approach but both improve compression in my > experiments. > > The current release doesn't use either approach and only uses a general > purpose context model, but even the tiny dictionary helps. I used the > second approach in ZPAQ because it's simpler. This works by forming one or > two word contexts by hashing whole words, ignoring case and folding any > character sequence other than letters into single token boundaries. > > -- Matt Mahoney, [email protected] > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tefdd3e588dd95259-Mac3c01f54bc3c4c2ef322412 Delivery options: https://agi.topicbox.com/groups/agi/subscription
