[agi] Re: Why is AGI so hard?

Matt Mahoney Sun, 22 Mar 2026 15:50:56 -0700

I made a bit of progress in my Hutter prize entry. I added token
parsing and now compress enwik9 to 136 MB in 50 minutes using 5 GB
memory. It is still off the Pareto frontier set by Dmitry Shkarin's
durilca'kingsize at 127 MB in 25-30 minutes using 13 GB memory on a
3.8 GHz Q9650 set in 2009. Like all the top entrants, it uses a
dictionary built from enwik9 and organized to group related words
together to encode the text before compression, although it uses PPM
rather than context mixing like all the others. Many of the others
combine PPM with context mixing to save memory.


I didn't implement a dictionary yet, which puts it in the top ranked
position among programs that don't use one, behind 15 that do. What I
did was mix a bunch of token based contexts, where a token is either a
group of letters or a single non-letter, possibly repeated.It uses 3
ICM-ISSE chains, a 4 head match model, and a descending order 4 mixer
chain for a total of 23 components. I could add more to improve
compression at the cost of time and memory, so I had to stop with
something reasonable.

The 3 ICM-ISSE chains are byte aligned order 1-2-3-4-5, token aligned
order 0-1-2-3-4-5-6-7, and sparse token pair order with gaps of
6-5-4-3-2-1 between the current and previous token. I modified the
match model to find up to 4 matches of length 6 or more and predict
the next bit with probability 1/length. (Adding more than 4 doesn't
help). The 4 mixers are chained with contexts of 2 tokens, 2 bytes, 1
byte, 0 bytes including the current partial byte or token to select
the mixing weights for all prior components including the earlier
mixers. There is one constant component used to bias the mixers.

The ICMs and ISSEs all share a common 4 GB hash table of bit histories
that map a context hash to a bit sequence (an 8 bit state) from zpaq,
and then to a prediction. The order 2 token mixer needs about 1 GB to
store the mixer weights. I did some experiments to normalize the mixer
weights to add to 1 in each context, and modify the bit history state
table to favor stationary or nonstationary sources, but did not
improve on the zpaq model. The match model uses a 0.5 GB hash table to
find matches. All the processing is done internally on strings up to 1
GB long with no temporary files.

I use a fixed byte code dictionary trained on enwik9 by byte pair
encoding. The characters "^" and "@" indicate the next letter or word
is capitalized. Brackets [ ] enclose web links and [[ ]] enclose
article links. It uses '' and ''' to enclose italics and bold, and ==
for level 2 headers. The text is already sorted by article topic and
the XML is unwrapped prior to capitalization and dictionary byte
encoding. Rare characters including UTF8 are escaped.

  "","",  // 0-1 reserved for literals, UTF8 codes
  "\n"," ","^","@",  // 2-5 whitespace and cap marks

  // 6-33 other punctuation
  ".",",","\"","#",
  "%","&","'","''","'''","(",")","*","-","/",
  ":",";","<","=","==",">","[","[[","]","]]",
  "_","{","|","}",

  // 34-43 digits
  "0","1","2","3","4","5","6","7","8","9",

  // 44-255 common lower case letter groups
  "a","ac","ad","age","al","also","am","american","an",
  "and","ar","are","as","at","ation","b","ba","be","bl",
  "bo","br","bu","by","c","ca","category","ce","census","cent",
  "ch","ci","city","cl","co","com","comp","cons","cont","county",
  "ct","ction","d","da","de","der","di","do","e","ea",
  "ec","ed","el","em","en","ent","er","es","ex","f",
  "fa","fe","fi","for","fr","from","g","ga","ge","ght",
  "gr","h","ha","have","he","hi","his","ho","household","http",
  "i","ia","ic","ie","il","image","in","ing","inte","ion",
  "is","it","j","k","ke","king","l","la","land","le",
  "li","living","ll","lo","ly","m","ma","males","man","mber",
  "me","ment","mi","mo","n","na","national","nce","nd","ne",
  "new","ng","ni","no","ns","nt","o","of","ol","on",
  "one","op","or","other","ou","over","p","pa","part","pe",
  "pl","po","population","port","pr","pres","pro","q","qu","r",
  "ra","rd","re","ri","ro","rs","rt","ry","s","sa",
  "sc","se","sh","si","so","some","sp","ss","st","states",
  "su","sup","t","ta","te","ted","ter","th","that","the",
  "there","this","ti","tion","to","tr","tt","tur","u","ul",
  "un","under","united","ur","us","use","v","ve","ver","vi",
  "w","wa","was","we","wh","which","wi","with","wo","www",
  "x","y","z"};

On Sat, Mar 21, 2026 at 1:27 PM Matt Mahoney <[email protected]> wrote:
>
> Yes, I know I wrote a paper in 2013 estimating that automating the
> economy with AGI would cost $1 quadrillion, mostly to collect 10^17
> bits of human knowledge. This proved correct when it took companies
> with trillion dollar market caps to produce LLMs. Those are the ones
> that have access to your emails, texts, and social media posts that go
> far beyond the 10^13 bits you can suck off the public internet. Even
> so, we are less than 1% of the way there, which is why AI has not put
> a dent in the employment rate yet.
>
> But that's not what I'm trying to build. I'm building a human level
> small language model (SLM). It's not a probabilistic logic knowledge
> base like the ones that Ben Goertzel, YKY, and Pei Wang were
> developing before they left the group when LLMs proved in 2023 that
> all you need to pass the Turing test is text prediction, like I
> predicted in a 1999 paper. That's basically the Hutter prize. I do
> appreciate that 2 of the 3 Hutter prize committee members (me and
> James Bowery) are still active here, and others (Immortal Discoveries,
> or submerge on encode.su) are pursuing this approach as well.
>
> My math mostly agrees with Turing's 1950 prediction that a computer
> with 10^9 bits of memory, but no faster than current technology
> (mechanical relays are as fast as neurons) would win the imitation
> game (now known as the Turing test) by 2000. His forecast of Moore's
> law was remarkably prescient, given that Gordon Moore didn't state it
> until 1965. Turing's paper was published just after Shannon invented
> information theory and estimated the entropy of English at about 1 bit
> per character, consistent with the top results on my large text
> benchmark. It also predates Landauer's 1973 estimate of 10^9 bits of
> human long term memory capacity, although Turing could have easily
> estimated how many words we process in a lifetime.
>
> My math says a SLM can be implemented on a single CPU at 10,000 x real
> time, compressing a lifetime of learning into a day. You have a
> vocabulary of about 50K tokens with a Zipf distribution, where the
> n'th most frequent word has a frequency of about 0.1/n. You have a
> short term memory of about 7 tokens, where low frequency tokens
> persist longer. You have a 50K by 50K matrix mapping short term memory
> to the predicted token, with the sparse parts of the matrix
> implemented as hidden layers in a neural network to cut the parameter
> space to 10^9. Updates should be fast because the learning rate is
> only about 4 bits per token, so only a small number of parameters need
> to be updated. Predictions should likewise be fast if we implement an
> attention mechanism in the hidden layer (like in transformers), where
> all but the few most active neurons are set to 0.
>
> But it is still hard. I suppose if it wasn't, we would have solved AI
> 23 years earlier. Two months ago I released a version that compressed
> enwik9 to 145 MB in 10 minutes using article sorting by topic, XML
> unwrapping, capitalization encoding, a tiny dictionary, and a pure
> linear context model. The plan is to mix these predictions with the
> language model, which I have yet to write. Instead I spent the last 2
> months refining the context model. I had a bunch of ideas to
> dramatically improve speed or memory usage, but ended up spending days
> to implement and debug them, only to see it either didn't work or the
> improvement was so marginal it wasn't worth the effort. As the program
> grows, each update is like brain surgery, carefully changing 1 or 2
> lines and testing in case I broke something and have to go back. In 2
> months, all I have to show for it is 142 MB in 20 minutes, a tiny
> movement along the Pareto frontier that isn't even worth releasing. I
> need to get to 110 MB, but as I do, testing times will go from minutes
> to hours to days.
>
> There's something I'm not getting. Why does the brain need 10^15
> synapses to store 10^9 bits? Maybe it's a speed optimization, like how
> a server farm has a million copies of Linux, or your body has 10^13
> copies of your DNA. Or is it something else? Is it the reason we
> didn't solve AI in 2000?
>
> --
> -- Matt Mahoney, [email protected]



-- 
-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tc9fe35df94409188-M4a882493486b42919667b6ca
Delivery options: https://agi.topicbox.com/groups/agi/subscription

[agi] Re: Why is AGI so hard?

Reply via email to