Re: [agi] Working on the smallest AI code possible and I'm making progress rapidly

Matt Mahoney Fri, 17 Oct 2025 13:38:55 -0700

Here are some results I have for enwik9 on my Lenovo laptop (Core
i7-1165G7, 2.80 MHz, 16 GB, Win11). The table shows size and
compression time, with CPU time in parentheses when more then one core
is used. The second number is decompression time in seconds. All of
the zpaq models are multi-threaded by dividing up the input into 64 MB
blocks and compressing in separate threads. The -m510 option is a
single 1 GB block and 1 thread. This is a large context mixing
algorithm which takes an hour to compress and an hour to decompress.
That is still 50 times faster than the Hutter prize limit, but nowhere
close to the prize threshold, which would be 1% better than
110,351,665 by the current leader, fx2_cmix. There are two better
programs (nncp at 107,261,318 and cmix at 108,244,767) but don't meet
the CPU time and memory requirements.
https://mattmahoney.net/dc/text.html


             enwik9 wall_time (cpu time), decompress time
zip -9       322592218 74
zip -6       323743105 62 9
7zip         230135777 207 (994) 13
zpaq -m1     314117967 32 (78)
zpaq -m2     268460216 237 (472)
zpaq -m3     189875989 224 (442)
zpaq -m4     179455248 720 (1387)
zpaq64 -m5   168590740 1000 (7114)
zpaq64 -m510 153626112 3971

Here is an older version of my current compressor (wp version 31) for
enwik8 (100 MB) and enwik9 (1 GB) showing size, compression time.
Using all stages compresses faster than LZ77 alone by reducing the
input to the LZ77 stage.

wp31       enwik8       enwik9
  l        37165333 42  281990999 895
  al       36568260 45
  abl      36424390 41
  abdl     35835346 38
  abcdl    35366083 45  274155976 537 25

The command options are:

a = article reordering.
b = basic XML decoding.
c = capitalization modeling.
d = dictionary encoding.
l = LZ77 byte aligned compression.

The preprocessing is fast and it reduces the input to the LZ77
compressor, thus is faster than using LZ77 alone. A more recent
version improved the speed to 461 seconds to compress enwik9:

wp38 l enwik8  37066549 25
wp38 l enwik9 281282986 379
wp38 abcdl enwik8  34735186 45
wp38 abcdl enwik9 268485627 461

Decompression for enwik9 is 10 seconds for LZ77 alone and 25 seconds
for all stages. The LZ77 format is a simple one byte length code of
1..64 for literals and 1..192 for matches followed by a 1 to 4 byte
match offset. Literals are not compressed. To find matches, I use
suffix sorting and search forward and backward from the current point
in the suffix array for the first past suffix and take the longer of
the two. If coding a match doesn't save space then I code a literal
and search the next suffix. I spent most of my effort optimizing the
suffix sorting, which takes about 1000 seconds using
std::stable_sort() or 1500 seconds using std::sort() or qsort() with
memcmp(). I wrote a radix sort that was faster, but I abandoned it
after developed my own group sort algorithm which looks like this:

// Suffix sort s into suffix array [begin, end), where the prior level bytes
// of s are equal. The elements of the suffix array are
// unique and point to s. At level 0, sa[i] = i for all elements.
template<class T>
void sa_sort(const unsigned char* s, T begin, T end, int level) {
  if (end<=begin+2) return;
  if (level>192) return;

  // Fill the high 32 bits of each suffix pointer with the 4 bytes
that it points to.
  if (level%4==0) {
    for (T p=begin; p<end; ++p) {
      const unsigned char* s4=s+(*p&0xffffffff);
      uint32_t cxt=s4[0]<<24|s4[1]<<16|s4[2]<<8|s4[3];
      *p=*p&0xffffffff|uint64_t(cxt)<<32;
    }
  }

  // group sort
  for (T p=begin; p<end;) {
    T e=p+1;
    for (T q=e; q<end; ++q)
      if (((*p^*q)&255ull<<56-level%4*8)==0) swap(*e++, *q);  // byte matches?
    assert(e>p);
    if (e-p>=3) sa_sort(s+1, p, e, level+1);
    p=e;
  }
}

where T is a pointer to the suffix array of 64 bit unsigned integers.
The low 32 bits is the index into the string s to be sorted, and the
upper 32 bits is the first 4 bytes that it points to. These are filled
every 4th level of recursion to reduce the number of random memory
accesses, which is the bottleneck in any algorithm that uses lots of
memory. The array is 8 GB for enwik9. To sort the range from begin to
end, we examine the next element and scan the rest of the range for
pointers to the same byte value and move them to the front. Then you
sort that group at the next recursion level. This works because,
unlike suffix sorting for a Burrows Wheeler transform, the alphabet
order is irrelevant for just finding the longest match. If there are
less than 3 elements, sorting is not needed. Also, the recursion level
is limited to the longest match that can be coded, 192.

Of course all of this is a distraction because I won't be using LZ77
in the final version. This is just for testing preprocessing
algorithms, like a tokenizer for semantic and syntactic models I have
yet to write.

In case you are curious, here is the dictionary created by 6 passes of
byte pair encoding created with command "abcd", showing the code,
number of tokens, and the value. Tokens are restricted to letters,
"&", ";", and repeated punctuation characters. The codes A, B, C are
for capitalization modeling (string, binary, next letter).

0 11921214 "
"
1 121844970 " "
2 530859 "#"
3 970827 "%"
4 486859 "&"
5 916351 "&amp;"
6 1149961 "&gt;"
7 1144472 "&lt;"
8 1874198 "&quot;"
9 972354 "'"
10 2488470 "''"
11 958496 "'''"
12 2521806 "("
13 2523537 ")"
14 2490470 "*"
15 7978922 ","
16 3679133 "-"
17 9578832 "."
18 1759848 "/"
19 415628 "//"
20 5754062 "0"
21 6037325 "1"
22 4721386 "2"
23 2736971 "3"
24 2579245 "4"
25 2596182 "5"
26 2372995 "6"
27 2005136 "7"
28 2443325 "8"
29 3239688 "9"
30 3293044 ":"
31 1114289 ";"
32 1290011 "="
33 1359774 "=="
34 1909184 "A"
35 2643909 "B"
36 35435937 "C"
37 423229 "["
38 9895488 "[["
39 423068 "]"
40 9896578 "]]"
41 494564 "_"
42 6337670 "a"
43 510528 "ab"
44 1239657 "ac"
45 970760 "ad"
46 1029485 "age"
47 526600 "ai"
48 3234112 "al"
49 1010016 "am"
50 3423346 "an"
51 3307299 "and"
52 687181 "ap"
53 1967998 "ar"
54 1196668 "are"
55 1943023 "as"
56 2032760 "at"
57 460948 "atio"
58 1551694 "b"
59 1117759 "ba"
60 2327268 "be"
61 771346 "bi"
62 816773 "bl"
63 986911 "bo"
64 894688 "br"
65 838299 "bu"
66 830598 "by"
67 2479230 "c"
68 2365778 "ca"
69 422182 "category"
70 2292423 "ce"
71 426076 "census"
72 2151588 "ch"
73 1328554 "ci"
74 540743 "ck"
75 880189 "cl"
76 2364568 "co"
77 1168046 "com"
78 419013 "coun"
79 597209 "cr"
80 1115051 "ct"
81 5638844 "d"
82 801049 "da"
83 2531462 "de"
84 641041 "der"
85 2031112 "di"
86 675722 "do"
87 7209729 "e"
88 1253557 "ea"
89 682224 "ec"
90 2290660 "ed"
91 830792 "el"
92 745627 "em"
93 1772293 "en"
94 825141 "ent"
95 2602737 "er"
96 2559169 "es"
97 621367 "et"
98 497845 "ev"
99 676929 "ex"
100 2537656 "f"
101 854098 "fa"
102 1106322 "fe"
103 1398530 "fi"
104 495359 "fo"
105 1737219 "for"
106 672579 "fr"
107 666211 "from"
108 2429840 "g"
109 860301 "ga"
110 1573976 "ge"
111 914737 "gh"
112 705125 "gi"
113 615387 "go"
114 1075888 "gr"
115 2550495 "h"
116 1626422 "ha"
117 1075365 "he"
118 741868 "her"
119 859363 "hi"
120 651537 "his"
121 1417299 "ho"
122 657811 "ht"
123 3089841 "i"
124 912901 "ia"
125 1936695 "ic"
126 672179 "id"
127 662988 "ie"
128 522251 "ig"
129 865653 "il"
130 732039 "im"
131 5753220 "in"
132 1329083 "ing"
133 787761 "ion"
134 3234763 "is"
135 2239717 "it"
136 2474959 "j"
137 3369403 "k"
138 940510 "ke"
139 732162 "ki"
140 4317422 "l"
141 1996103 "la"
142 469636 "land"
143 722481 "ld"
144 2266518 "le"
145 537623 "les"
146 2667515 "li"
147 1900339 "ll"
148 1543979 "lo"
149 1231844 "ly"
150 4067532 "m"
151 2690664 "ma"
152 713025 "man"
153 503599 "md"
154 2527240 "me"
155 432550 "ment"
156 1808968 "mi"
157 1694419 "mo"
158 6450068 "n"
159 1942882 "na"
160 677610 "nc"
161 1144410 "nd"
162 2388254 "ne"
163 2541916 "ng"
164 1262135 "ni"
165 1776463 "no"
166 1121992 "ns"
167 1629690 "nt"
168 4161887 "o"
169 498398 "od"
170 5213601 "of"
171 822113 "ol"
172 2628237 "on"
173 673430 "op"
174 2068485 "or"
175 529638 "os"
176 730539 "ot"
177 1297161 "ou"
178 2557799 "p"
179 1557855 "pa"
180 1518546 "pe"
181 469852 "ph"
182 699336 "pi"
183 1051542 "pl"
184 1700105 "po"
185 878257 "pr"
186 800321 "pro"
187 778880 "pu"
188 95207 "q"
189 643156 "qu"
190 5661716 "r"
191 1798833 "ra"
192 684235 "rd"
193 4372533 "re"
194 2013961 "ri"
195 1273981 "ro"
196 431453 "rr"
197 1043914 "rs"
198 1228329 "rt"
199 478320 "ry"
200 10171663 "s"
201 795985 "sa"
202 751138 "sc"
203 2297350 "se"
204 1042029 "sh"
205 1887637 "si"
206 1608421 "so"
207 910282 "sp"
208 836136 "ss"
209 3832328 "st"
210 939254 "su"
211 569484 "sup"
212 5860644 "t"
213 1432969 "ta"
214 1733646 "te"
215 610201 "ted"
216 1534691 "ter"
217 1944525 "th"
218 621742 "that"
219 8682988 "the"
220 547563 "ther"
221 2142005 "ti"
222 1380450 "tion"
223 3920552 "to"
224 429199 "tp"
225 1369850 "tr"
226 532323 "ts"
227 555993 "tu"
228 984030 "ty"
229 4055499 "u"
230 916736 "ul"
231 714434 "um"
232 1790939 "un"
233 1378362 "ur"
234 1420865 "us"
235 812229 "use"
236 725266 "ut"
237 758481 "v"
238 686680 "va"
239 1982437 "ve"
240 1379217 "ver"
241 1532799 "vi"
242 3107800 "w"
243 986844 "wa"
244 866664 "was"
245 1311170 "we"
246 1207604 "wh"
247 833994 "wi"
248 809371 "with"
249 865580 "wo"
250 992094 "x"
251 6381496 "y"
252 1306032 "z"
253 901242 "{"
254 4907106 "|"
255 901320 "}"
Output = 636650052
19: "//" (415628) -> "w"+"w" (663347)
Size 636650052 -> 636731859
Dictionary decoding 636650052 -> 979768178
979768178 -> 636650052 -> 979768178 Decode test passed
After d: size 636650052
Writing 636650052 bytes
230.02 sec.

On Fri, Oct 17, 2025 at 1:36 PM <[email protected]> wrote:
>
> Yes it will be open source. I will share a cleaner and better version soon.
>
> Enwik8/9 dataset, but my plan is to use the first 100,000 bytes for faster 
> testing, until it matches what Byron Knoll's scores for it. Which is about 
> 20,000 or less basically. I use the pre-processed version for now, which 
> feeds in actually about 65,000 bytes. With my plain algorithm of 15 order 
> matches (training on letters it pops out from lossless decompression), 
> without recency priming, I score ~24,950, but ~25,052 if use an 
> self-amplifier instead of controlling which are mixed first (because that 
> could hurt parallelism). Also, with only a tiny amount (3 possible 
> arrangements and about 3 others helped only slightly, half help big 
> therefore) of the entire holes and delay mechanism, I already score 24,252, 
> and there was 3 different keys to this I realized this month that made it 
> work better - and they are not very obvious if you were to go and try this 
> without heavy/smart thinking and experimenting !, and if I add all of I might 
> score 22,500. If I add back my priming, it's about 500 down basically also, 
> so maybe 22,000. Related words should in theory bring it down to 20,000 or 
> something. And 1 other mechanism might be Mirroring the abilities using a 
> command matches with an example and then half the unseen example, because you 
> can ask someone to say a very related word to any word, but that's impossible 
> with related matches even with holes and delays. Of course now we know 
> thinking too (generate data on demand) allows better prediction, and tool use.
>
> The last 2 days were good, I've learned a lot again fast, I think I could 
> possibly have a plan that might actually run on GPU. No backprop. It turns 
> out, it almost seems as if we don't need to use 10,000 gpus to train gpt3/4 
> for a month, I think the network can be "constructed" first on CPU and 
> literally with plain yes plain strings hierarchically (not trie/tree) in 1 
> day, and then ran on GPU for inference using the special algorithm I have in 
> mind so that these exact string also match non-exact inputs. Possibly maybe 
> even all on GPU or all on CPU, but either way I believe it costs "5 dollars" 
> - and not 5 million, to make gpt3/4.
>
> Hierarchy structure is what allows the half activated memory to be continued 
> to be matched once you get to the end of a long sentence with a injected 
> middle part like "the cat that was on the mat on the bed in the room is 
> sleeping", so now the node "the cat is sleeping" is fully matched. Of course 
> with delay.
>
> Keep in mind, delays in the many matches looks for pattern too, if in 1 match 
> the items are all stretched or rotated, it's still "a cat", because the 
> delays all cancel out mathematically (or mostly, with some remaining penalty).
>
> These numbers below are probably slightly off that I made this month if code 
> is lagging in smaller data due to optimizations for bigger data or if simply 
> possibly does better with more data, but it's still helpful to look at!:
> Byron's is the bottom 2 scores (he recently has a better score yes, outdated, 
> but it's usable for now):
> if i score this i should be able to score this:
> 25,000 > 18,497,970
> 24,000 > 17,758,051
> 23,000 > 17,018,132
> 22,000 > 16,278,214
> 21,000 > 15,538,295
> 20,054 > 14,838,332   (0.2600811808118081 % down. Byron's 2 scores for 
> 100,000 bytes and 100,000,000).
>
> Byrons scores as 10x more data is fed in, note i already inflated the upper 
> by 0s at the end for comparison only
> 20,054,000   ## for 100,000 bytes, inflated only for show
> 17,638,800   (0.1204348259698813 % drop)
> 16,514,210   (0.0637566047576933 % drop)
> 14,838,332   (0.1014809669975131 % drop)
> Artificial General Intelligence List / AGI / see discussions + participants + 
> delivery options Permalink



-- 
-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T6cf3be509c7cd2f2-M4ff332ec5d5fee02f08a229a
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Working on the smallest AI code possible and I'm making progress rapidly

Reply via email to