What are you using for training data? What algorithms? How many parameters?

I have been playing around with a Hutter prize entry. I am on the committee
so I am not eligible for prize money, but this could be a baseline for
others to modify and make improvements. If it is successful, I will license
it under GPL to make sure the source code remains free and open. The Hutter
prize requires a 1 GB text decompression time of 50 thread hours on my
Lenovo laptop (core i7-1165, 2.8 GHz, 16 GB, Win11 or Ubuntu). All of the
current entries are incremental improvements of CMIX, which is based on
PAQ. I would like to see a parallel development track.

So far I have written some preprocessors and testing whether they improve
compression on other compressors. These consist of:

1. Article reordering by topic. I had some success reordering by matching
common tokens or substrings but I wasn't able improve on the ordering used
in the current winning submission by anything that ran reasonably quickly.

2. XML decoding the text and header fields (ID, author, timestamp, comment,
etc) into separate streams.

3. Capitalization modeling, coding uppercase letters and words as lower
case and a special symbol.

4. Tiny dictionary encoding using byte pair encoding, replacing the least
frequent byres with codes for the most frequent byte pairs until there is
no more size reduction, which takes about 6 passes when the pairing is
restricted to groups of letters or groups of repeated punctuation symbols
that are used in the XML, HTML, and Wikipedia markup.

These all improve compression when the output is fed to zip (LZ77), 7zip
(LZ77 + context modeling), BSC (BWT), PPMD (PPM), and some ZPAQ context
mixing formats. I am also writing my own LZ77 compressor for rapid testing.
I will probably incorporate some LibZPAQ code for context mixing (which I
released public domain), but I also have some ideas for more memory
efficient indirect context modeling. These are what use up most of the
memory currently. These work by mapping a bitwise context to an 8 bit state
representing a bit history and then to a probability. The first table is
very large and sparse and arranged to minimize cache misses, which is the
speed bottleneck.

I have some ideas for tokenization and for modeling a semantic network with
an attention mechanism like in a transformer, but that doesn't require a
GPU to run reasonably fast. But it will be awhile before I have any code
ready to release.

-- Matt Mahoney, [email protected]

On Wed, Oct 15, 2025, 3:49 AM Basile Starynkevitch <[email protected]>
wrote:

> On Thu, 2025-10-09 at 19:44 -0400, [email protected] wrote:
> > I took a break for 4 years, but I'm researching where I left off at in
> my code I was working on.
> >
> > I'm solving many problems fast and have a clear and full roadmap.
> 
> 
> Is your code open source (and testable on Linux)? Where?
> 
> Thanks
> --
> 
> Basile STARYNKEVITCH                            <[email protected]>
> 8 rue de la Faïencerie
> http://starynkevitch.net/Basile/
> 92340 Bourg-la-Reine                         https://github.com/bstarynk
> France
> https://github.com/RefPerSys/RefPerSys
>                   https://orcid.org/0000-0003-0908-5250

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T6cf3be509c7cd2f2-M3ce8a55f9fdaea5f050521ea
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to