The top text compressors use simple models of semantics and grammar
that group words into categories as fuzzy equivalence relations. For
semantics, the rules are reflexive, A predicts A (but not too close.
Probability peaks 50-100 bytes away), symmetric, A..B predicts A..B
and B..A, and transitive, A..B, B..C predicts A..C. For grammar, AB
predicts AB (n-grams), and AB, CB, CD, predicts AD (learning the rule
{A,C}{B,D}). Even the simplest compressors like zip model n-grams. The
top compressors learn groupings. For example, "white house", "white
car", "red house" predicts the novel "red car". For cmix variants, the
dictionary would be "white red...house car" and take whole groups as
contexts. The dictionary can be built automatically by clustering in
context space.

Compressors model semantics using sparse contexts. To get the reverse
prediction "A..B....B..A" and transitive prediction
"A..B....B..C...A..C you use a short term memory like LSTM both for
learning associations and as context for prediction.

Humans use lexical, semantic, and grammar induction to predict text.
For example, how do you predict, "The flob ate the glork. What do
flobs eat?"

Your semanic model learned to associate "flob" with "glork", "eat"
with "glork" and "eat" with "ate". Your grammar model learned that
"the" is usually followed by a noun and that nouns are sometimes
followed by the plural "s". Your lexical model tells you that there is
no space before the "s". Thus, you and a good language model predict
the novel word "glorks".

All of this has a straightforward implementation with neural networks.
It takes a lot of computation because you need on the order as many
parameters as you have bits of training data, around 10^9 for human
level. Current LLMs are far beyond that with 10^13 bits or so. The
basic operations are prediction, y = Wx, and training, W += xy^t,
where x is the input word vector, y is the output word probability
vector, W is the weight matrix, and ^t means transpose. Both
operations require similar computation (the number of parameters,
|W|), but training requires more hardware because you are compressing
a million years worth of text in a few days. Prediction for chatbots
only has to be real time, about 10 bits per second.

And as I have been saying since 2006, text prediction (measured by
compression) is all you need to pass the Turing test, and therefore
all you need to appear conscious or sentient.

-- 
-- Matt Mahoney, mattmahone...@gmail.com

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T682a307a763c1ced-M0a4075c52c080ace6a702efa
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to