What are you using for training data? What algorithms? How many parameters?
I have been playing around with a Hutter prize entry. I am on the committee so I am not eligible for prize money, but this could be a baseline for others to modify and make improvements. If it is successful, I will license it under GPL to make sure the source code remains free and open. The Hutter prize requires a 1 GB text decompression time of 50 thread hours on my Lenovo laptop (core i7-1165, 2.8 GHz, 16 GB, Win11 or Ubuntu). All of the current entries are incremental improvements of CMIX, which is based on PAQ. I would like to see a parallel development track. So far I have written some preprocessors and testing whether they improve compression on other compressors. These consist of: 1. Article reordering by topic. I had some success reordering by matching common tokens or substrings but I wasn't able improve on the ordering used in the current winning submission by anything that ran reasonably quickly. 2. XML decoding the text and header fields (ID, author, timestamp, comment, etc) into separate streams. 3. Capitalization modeling, coding uppercase letters and words as lower case and a special symbol. 4. Tiny dictionary encoding using byte pair encoding, replacing the least frequent byres with codes for the most frequent byte pairs until there is no more size reduction, which takes about 6 passes when the pairing is restricted to groups of letters or groups of repeated punctuation symbols that are used in the XML, HTML, and Wikipedia markup. These all improve compression when the output is fed to zip (LZ77), 7zip (LZ77 + context modeling), BSC (BWT), PPMD (PPM), and some ZPAQ context mixing formats. I am also writing my own LZ77 compressor for rapid testing. I will probably incorporate some LibZPAQ code for context mixing (which I released public domain), but I also have some ideas for more memory efficient indirect context modeling. These are what use up most of the memory currently. These work by mapping a bitwise context to an 8 bit state representing a bit history and then to a probability. The first table is very large and sparse and arranged to minimize cache misses, which is the speed bottleneck. I have some ideas for tokenization and for modeling a semantic network with an attention mechanism like in a transformer, but that doesn't require a GPU to run reasonably fast. But it will be awhile before I have any code ready to release. -- Matt Mahoney, [email protected] On Wed, Oct 15, 2025, 3:49 AM Basile Starynkevitch <[email protected]> wrote: > On Thu, 2025-10-09 at 19:44 -0400, [email protected] wrote: > > I took a break for 4 years, but I'm researching where I left off at in > my code I was working on. > > > > I'm solving many problems fast and have a clear and full roadmap. > > > Is your code open source (and testable on Linux)? Where? > > Thanks > -- > > Basile STARYNKEVITCH <[email protected]> > 8 rue de la Faïencerie > http://starynkevitch.net/Basile/ > 92340 Bourg-la-Reine https://github.com/bstarynk > France > https://github.com/RefPerSys/RefPerSys > https://orcid.org/0000-0003-0908-5250 ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T6cf3be509c7cd2f2-M3ce8a55f9fdaea5f050521ea Delivery options: https://agi.topicbox.com/groups/agi/subscription
