On Tue, Feb 3, 2026, 10:15 AM James Bowery <[email protected]> wrote:

>
> On Mon, Feb 2, 2026 at 8:56 PM Matt Mahoney <[email protected]>
> wrote:
>
>> I released another update to my Hutter prize entry.
>> https://encode.su/threads/4467-enwik9-preprocessor#post87076
>>
>> ...This is doable on a neural network with 10^9 parameters because the
>> learning rate is only 5 bits per token for 200M tokens.
>>
>
> "Token" refers to the residual byte pairs from your byte pair coding
> approach?
>

No, I mean independent semantic units, mostly base words and suffixes. So
"rotations" would be 3 tokens "rotate", "-tion" and "-s". The tiny 256 word
dictionary I currently use codes most of the common suffixes as single
bytes but base words are usually more than one byte. So I either need to
use a larger dictionary coding tokens as either 1 or 2 bytes and hard code
stemming rules, or model tokens in the context models. Most of the top
compressors use the first approach but both improve compression in my
experiments.

The current release doesn't use either approach and only uses a general
purpose context model, but even the tiny dictionary helps. I used the
second approach in ZPAQ because it's simpler. This works by forming one or
two word contexts by hashing whole words, ignoring case and folding any
character sequence other than letters into single token boundaries.

-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tefdd3e588dd95259-M74c789abe929ee3067954e0b
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to