Re: [agi] semantic and syntax (comparing new and old ways)

Matt Mahoney Sat, 24 Jan 2026 07:57:33 -0800

If there was a mathematical way to derive the data, then that equation
would be the compressed representation and the size would be much smaller.
I don't believe that there is, because the data comes from human brains. We
can compress the XML, HTML, wiki markup, and the automatically generated
articles about places in the US census data, but that's about it.


I don't understand the rest of your questions. The model I proposed models
the language capabilities of a 3 year old child. To model more complex
grammar like sentences, paragraphs, and math, I would need a multi layer
neural network that might exceed the Hutter prize hardware limits.

-- Matt Mahoney, [email protected]

On Sat, Jan 24, 2026, 3:10 AM Quan Tesla <[email protected]> wrote:

> Just thinking here...
>
> Is the notion that words could be mathematically constructed/derived too
> far fetched for this type of application?
>
> Classification seems to be a major overhead.
>
> Where is the boundary between compressor and interpreter?
>
> Could compression overburden be outsourced in real time to the
> compressor-interpreter staging phase, perhaps in working memory?
>
> With regards the notion of meaning being mathematically derivable, as
> opposed to being hard coded or constructed, I'm referring here to symbolic
> systems.
>
> As a simplistic example, it reminds me conceptually of employing the power
> of data definition, lookup tables, and code reuse 'include' calls for
> structured programming, only not formally constructed.
>
> Is preconstruction in principle acceptable?
>
> On Sat, 24 Jan 2026, 09:52 Matt Mahoney, <[email protected]> wrote:
>
>> I think what you are doing with syntax and semantic trees is using a more
>> efficient representation of sparse matrices. This is useful because the
>> vocabulary has a Zipf distribution where the n'th most frequent word has a
>> frequency around 0.1/n. So you avoid having a huge 2-D array filled with
>> mostly zeros. The problem with this and also the context models I am using
>> is you can't easily transpose the matrix (reflect it across the diagonal).
>> So if the next word matrix is A, where A_ij is the frequency of word i
>> followed by j, then the transpose (reading columns instead of rows) gives
>> you the previous word distribution. Then AA^t gives you words that have the
>> same word after, like "cat ate" and "dog ate". A^tA gives you words with
>> the same previous word, like "the cat" and "the dog". Both of these, like
>> "AA^t + A^tA maps nouns to nouns and verbs to verbs, which is useful to
>> predict unseen sequences. Like if you see "the blorg ate" you already know
>> that you can add an "s" to make the plural "blorgs", assuming your
>> tokenizer recognizes common suffixes like -s as separate tokens.
>>
>> What I would probably do is use a pair of small rectangular 2-D arrays
>> for high frequency words along the top and left edges and a context model
>> or array of lists for lower frequency words to save space and time.
>>
>> You can see the actual word distribution at
>> https://mattmahoney.net/dc/textdata.html
>>
>> It should be possible to compress enwik9 from 1000 MB to 370 MB with just
>> a lexical model without any next token prediction. That is just a
>> dictionary with word frequencies and the rules for parsing the text into
>> tokens. enwik9 has around 200M tokens and a vocabulary of 1.4M types, half
>> of which occur only once, but I would probably only use the most frequent
>> 30K to 50K types in my model and spell the rare words.
>>
>> -- Matt Mahoney, [email protected]
>>
>> On Fri, Jan 23, 2026, 1:36 AM <[email protected]> wrote:
>>
>>>
>>> https://agi.topicbox.com/groups/agi/T0518db1e3a0c25c5/preprocessor-for-hutter-prize
>>> Good messages here. I just saw them now. Still need to read them.
>>> Yes (lol) water water is rare, I also have mine lowering there right
>>> after it sees a word to prime it stronger.
>>>
>>> Matt here is how my semantic model works below. Is mine better or worse
>>> than yours?
>>>
>>> My current implementation just makes a 2 byte-depth tree. The 1st
>>> breadth (the first root bytes) are the word on the right side of words (ex.
>>> "dog sleeps") (that means I took every 2 words in the dataset and switched
>>> their order just so i could store them like this: sleeps dog). This makes
>>> building the word relations faster because the root starts with the shared
>>> proofs first. Now: after the trees is built, I have a list that has 256
>>> lists, each of these lists has 256 zeros. This is the relation score
>>> 0.0-1.0 for every byte to every byte. Now, I check every root byte: Say we
>>> have "sleeps": if we have dog, cat, horse, human after this byte in the
>>> mini tree, i go to other thing we made also - the 256 lists we made, and i
>>> go to therefore dog, cat, horse, and human since they are what we have at
>>> hand after this root byte. I give each to each a score. Now, what is that
>>> score is calculated this way: How many counts does dog have? 550? How many
>>> counts does cat have? 1000? If dog is lower, I normalize it. In this
>>> example dog is about 2x more rare in the dataset. So what I do is I check
>>> also: how many do they share here in the mini tree? We have in the mini
>>> tree "sleeps dog" x2 and "sleeps cat" x10. So what I do is, if dog is lower
>>> in counts total, I up the "sleeps dog" artificially from 2 to 2*2=4. So now
>>> we have 4. Dog and cat share 4. So then I store this in dog's list, and
>>> that is stored like this: 4 / total counts. Because if dog and cat share 4
>>> but dog has 1 billion counts in the dataset (so does cat after
>>> normalization), then while they share 4, they really share almost nothing,
>>> because if they shared 100% then they would share 1 billion as well. So it
>>> is stored as a PART of the final amount. We come back to dog's list in the
>>> NEXT proof. If dog and cat share more proofs, we will add more score on top
>>> the score we just saved. We only did 1 part so far I'm saying. Lastly, I
>>> also give proofs downranking if they are too common, this improved the
>>> score. Rare words help prove 2 words are related, while common words have
>>> much less effect.
>>>
>>> I just realized something (?): I read a lot about Transformers, but no
>>> one seems to have explained something I just understood tonight: The token
>>> embedding step (after Byte Pair Encoding (Token IDs)) is NOT only related
>>> word dimensional vectors, the dimensional vectors also store the word's
>>> syntax. So "cat" can be in a "space" near animal (semantic) related words,
>>> but this vector can also have "info" and have simultaneously the word "cat"
>>> near other words that are syntax words (not related words). So then, as
>>> these embeds flow up the Transformer, it is adding info, making the last
>>> vector of the User's Prompt (apparently only that vector is used to get the
>>> next words after all computation is done) more and more specialized and
>>> clearer, until at the last place it is unembed and this vector is therefore
>>> used to check against every word vector in a vocab, and so if it was had
>>> 100 animal names to the left of it in the prompt, it will predict those,
>>> not a next word, but a related word, while if it didn't have animal words
>>> much next to it and more syntax sentence flow words, it will predict such
>>> one of those type, correspondingly of course based on the sentence. So,
>>> it's not doing priming "and" syntax conditionalism "and" related words
>>> "mechanisms", it's just doing dimensional vectors that have both types
>>> already
>>>
>>> hmm let's compare:
>>>
>>> my way:
>>> 1. build syntax tree
>>> 2. build semantic tree
>>> 1. translate last word(s) to get next words (already includes without
>>> translation, for to search the tree)
>>> 2. translate last words to vote on next words
>>>
>>> their way:
>>> 1. build dimensional syntax vocab vectors
>>> 2. also is building into those dimensional vectors the semantic, using
>>> separate method still
>>> 1.&2.&nope???: but then it does seem to do a few "more things here, it
>>> is using self attention (each word looks at each word) which allows it to
>>> make the last token word's vector "become" either a semantic thing (if all
>>> the words are just related words, this makes sense and is ok) or a
>>> syntactic word (if not related words are to the left of it), yes it's doing
>>> priming and next word prediction in 1 go but then they are also after this
>>> self attention sending the vectors up into a FFN and stuff like that which
>>> is another step. Idk why now, exactly.
>>>
>> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> +
> delivery options <https://agi.topicbox.com/groups/agi/subscription>
> Permalink
> <https://agi.topicbox.com/groups/agi/T2d9ee7e1ee2cd20c-M09f455cfc27ac09b15f53164>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T2d9ee7e1ee2cd20c-Mb95d91fea3e70cad7baa60f5
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] semantic and syntax (comparing new and old ways)

Reply via email to