On Tue, May 28, 2024 at 7:46 AM Rob Freeman <chaotic.langu...@gmail.com> wrote:

> Now, let's try to get some more detail. How do compressors handle the
> case where you get {A,C} on the basis of AB, CB, but you don't get,
> say AX, CX? Which is to say, the rules contradict.

Compressors handle contradictory predictions by averaging them,
weighted both by the implied confidence of predictions near 0 or 1,
and the model's historical success rate. Although transformer based
LLMs predict a vector of word probabilities, PAQ based compressors
like CMIX predict one bit at a time, which is equivalent but has a
simpler implementation. You could have hundreds of context models
based on the last n bytes or word (the lexical model), short term
memory or sparse models (semantics), and learned word categories
(grammar). The context includes the already predicted bits of the
current word, like when you guess the next word one letter at a time.

The context model predictions are mixed using a simple neural network
with no hidden weights:

p =  squash(w stretch(x))

where x is the vector of input predictions in (0,1), w is the weight
vector, stretch(x) = ln(x/(1-x)), squash is the inverse = 1/(1 +
e^-x), and p is the final bit prediction. The effect of stretch() and
squash() is to favor predictions near 0 or 1. For example, if one
model guesses 0.5 and another is 0.99, the average would be about 0.9.
The weights are then adjusted to favor whichever models were closest:

w := w + L stretch(x) (y - p)

where y is the actual bit (0 or 1), y - p is the prediction error, and
L is the learning rate, typically around 0.001.

> "Halle (1959, 1962) and especially Chomsky (1964) subjected
> Bloomfieldian phonemics to a devastating critique."
>
> Generative Phonology
> Michael Kenstowicz
> http://lingphil.mit.edu/papers/kenstowicz/generative_phonology.pdf
>
> But really it's totally ignored. Machine learning does not address
> this to my knowledge. I'd welcome references to anyone talking about
> its relevance for machine learning.

Phonology is mostly irrelevant to text prediction. But an important
lesson is how infants learn to segment continuous speech around 8-10
months, before they learn their first word around 12 months. This is
important for learning languages without spaces like Chinese (a word
is 1 to 4 symbols, each representing a syllable). The solution is
simple. Word boundaries occur when the next symbol is less
predictable, reading either forward or backwards. I did this research
in 2000. https://cs.fit.edu/~mmahoney/dissertation/lex1.html

Language evolved to be learnable on neural networks faster than our
brains evolved to learn language. So understanding our algorithm is
important.

Hutter prize entrants have to prebuild a lot of the model because
computation is severely constrained (50 hours in a single thread with
10 GB memory). That includes a prebuilt dictionary. The human brain
takes 20 years to learn language on a 10 petaflop, 1 petabyte neural
network. So we are asking quite a bit.

-- 
-- Matt Mahoney, mattmahone...@gmail.com

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T682a307a763c1ced-M1f60044363c6d90c81505bcc
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to