New update compresses enwik9 to 134 MB in 55 minutes using 7 GB memory.
This is the top result on the large text benchmark that doesn't use a full
size dictionary trained on enwik9.
https://encode.su/threads/4467-enwik9-preprocessor#post87460

Mostly it is a speed vs size tradeoff. Mixing more components improves
compression but adds time and each component needs memory. I extended the
tiny dictionary to match strings in the 36K boilerplate articles on U.S.
places that makes up about 15% of enwik9 and improves compression by 1% and
took 1.5 days to analyze, code, test, and tune. The rest of the improvement
is from the tedious process of making thousands of tiny changes that either
make tiny improvements or get discarded. Just today I spent a few hours
fiddling with component data representations and learning rates, only to
make compression a bit worse and throwing out the code.

You might be wondering why I'm not using AI to write the code for my AI
program. That's because the purpose is not to produce a product. It is to
understand how AI works. I can't do that if I don't write the code myself.
It's like learning arithmetic. Using a calculator instead of a pencil is
faster but defeats the purpose.

-- Matt Mahoney, [email protected]

On Tue, Feb 3, 2026, 6:38 PM Matt Mahoney <[email protected]> wrote:

> On Tue, Feb 3, 2026, 10:15 AM James Bowery <[email protected]> wrote:
>
>>
>> On Mon, Feb 2, 2026 at 8:56 PM Matt Mahoney <[email protected]>
>> wrote:
>>
>>> I released another update to my Hutter prize entry.
>>> https://encode.su/threads/4467-enwik9-preprocessor#post87076
>>>
>>> ...This is doable on a neural network with 10^9 parameters because the
>>> learning rate is only 5 bits per token for 200M tokens.
>>>
>>
>> "Token" refers to the residual byte pairs from your byte pair coding
>> approach?
>>
>
> No, I mean independent semantic units, mostly base words and suffixes. So
> "rotations" would be 3 tokens "rotate", "-tion" and "-s". The tiny 256 word
> dictionary I currently use codes most of the common suffixes as single
> bytes but base words are usually more than one byte. So I either need to
> use a larger dictionary coding tokens as either 1 or 2 bytes and hard code
> stemming rules, or model tokens in the context models. Most of the top
> compressors use the first approach but both improve compression in my
> experiments.
>
> The current release doesn't use either approach and only uses a general
> purpose context model, but even the tiny dictionary helps. I used the
> second approach in ZPAQ because it's simpler. This works by forming one or
> two word contexts by hashing whole words, ignoring case and folding any
> character sequence other than letters into single token boundaries.
>
> -- Matt Mahoney, [email protected]
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tefdd3e588dd95259-Mac3c01f54bc3c4c2ef322412
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to