On Tue, Feb 3, 2026, 10:15 AM James Bowery <[email protected]> wrote:
> > On Mon, Feb 2, 2026 at 8:56 PM Matt Mahoney <[email protected]> > wrote: > >> I released another update to my Hutter prize entry. >> https://encode.su/threads/4467-enwik9-preprocessor#post87076 >> >> ...This is doable on a neural network with 10^9 parameters because the >> learning rate is only 5 bits per token for 200M tokens. >> > > "Token" refers to the residual byte pairs from your byte pair coding > approach? > No, I mean independent semantic units, mostly base words and suffixes. So "rotations" would be 3 tokens "rotate", "-tion" and "-s". The tiny 256 word dictionary I currently use codes most of the common suffixes as single bytes but base words are usually more than one byte. So I either need to use a larger dictionary coding tokens as either 1 or 2 bytes and hard code stemming rules, or model tokens in the context models. Most of the top compressors use the first approach but both improve compression in my experiments. The current release doesn't use either approach and only uses a general purpose context model, but even the tiny dictionary helps. I used the second approach in ZPAQ because it's simpler. This works by forming one or two word contexts by hashing whole words, ignoring case and folding any character sequence other than letters into single token boundaries. -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tefdd3e588dd95259-M74c789abe929ee3067954e0b Delivery options: https://agi.topicbox.com/groups/agi/subscription
