The top text compressors use simple models of semantics and grammar that group words into categories as fuzzy equivalence relations. For semantics, the rules are reflexive, A predicts A (but not too close. Probability peaks 50-100 bytes away), symmetric, A..B predicts A..B and B..A, and transitive, A..B, B..C predicts A..C. For grammar, AB predicts AB (n-grams), and AB, CB, CD, predicts AD (learning the rule {A,C}{B,D}). Even the simplest compressors like zip model n-grams. The top compressors learn groupings. For example, "white house", "white car", "red house" predicts the novel "red car". For cmix variants, the dictionary would be "white red...house car" and take whole groups as contexts. The dictionary can be built automatically by clustering in context space.
Compressors model semantics using sparse contexts. To get the reverse prediction "A..B....B..A" and transitive prediction "A..B....B..C...A..C you use a short term memory like LSTM both for learning associations and as context for prediction. Humans use lexical, semantic, and grammar induction to predict text. For example, how do you predict, "The flob ate the glork. What do flobs eat?" Your semanic model learned to associate "flob" with "glork", "eat" with "glork" and "eat" with "ate". Your grammar model learned that "the" is usually followed by a noun and that nouns are sometimes followed by the plural "s". Your lexical model tells you that there is no space before the "s". Thus, you and a good language model predict the novel word "glorks". All of this has a straightforward implementation with neural networks. It takes a lot of computation because you need on the order as many parameters as you have bits of training data, around 10^9 for human level. Current LLMs are far beyond that with 10^13 bits or so. The basic operations are prediction, y = Wx, and training, W += xy^t, where x is the input word vector, y is the output word probability vector, W is the weight matrix, and ^t means transpose. Both operations require similar computation (the number of parameters, |W|), but training requires more hardware because you are compressing a million years worth of text in a few days. Prediction for chatbots only has to be real time, about 10 bits per second. And as I have been saying since 2006, text prediction (measured by compression) is all you need to pass the Turing test, and therefore all you need to appear conscious or sentient. -- -- Matt Mahoney, mattmahone...@gmail.com ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T682a307a763c1ced-M0a4075c52c080ace6a702efa Delivery options: https://agi.topicbox.com/groups/agi/subscription