On Tue, May 28, 2024 at 7:46 AM Rob Freeman <chaotic.langu...@gmail.com> wrote:
> Now, let's try to get some more detail. How do compressors handle the > case where you get {A,C} on the basis of AB, CB, but you don't get, > say AX, CX? Which is to say, the rules contradict. Compressors handle contradictory predictions by averaging them, weighted both by the implied confidence of predictions near 0 or 1, and the model's historical success rate. Although transformer based LLMs predict a vector of word probabilities, PAQ based compressors like CMIX predict one bit at a time, which is equivalent but has a simpler implementation. You could have hundreds of context models based on the last n bytes or word (the lexical model), short term memory or sparse models (semantics), and learned word categories (grammar). The context includes the already predicted bits of the current word, like when you guess the next word one letter at a time. The context model predictions are mixed using a simple neural network with no hidden weights: p = squash(w stretch(x)) where x is the vector of input predictions in (0,1), w is the weight vector, stretch(x) = ln(x/(1-x)), squash is the inverse = 1/(1 + e^-x), and p is the final bit prediction. The effect of stretch() and squash() is to favor predictions near 0 or 1. For example, if one model guesses 0.5 and another is 0.99, the average would be about 0.9. The weights are then adjusted to favor whichever models were closest: w := w + L stretch(x) (y - p) where y is the actual bit (0 or 1), y - p is the prediction error, and L is the learning rate, typically around 0.001. > "Halle (1959, 1962) and especially Chomsky (1964) subjected > Bloomfieldian phonemics to a devastating critique." > > Generative Phonology > Michael Kenstowicz > http://lingphil.mit.edu/papers/kenstowicz/generative_phonology.pdf > > But really it's totally ignored. Machine learning does not address > this to my knowledge. I'd welcome references to anyone talking about > its relevance for machine learning. Phonology is mostly irrelevant to text prediction. But an important lesson is how infants learn to segment continuous speech around 8-10 months, before they learn their first word around 12 months. This is important for learning languages without spaces like Chinese (a word is 1 to 4 symbols, each representing a syllable). The solution is simple. Word boundaries occur when the next symbol is less predictable, reading either forward or backwards. I did this research in 2000. https://cs.fit.edu/~mmahoney/dissertation/lex1.html Language evolved to be learnable on neural networks faster than our brains evolved to learn language. So understanding our algorithm is important. Hutter prize entrants have to prebuild a lot of the model because computation is severely constrained (50 hours in a single thread with 10 GB memory). That includes a prebuilt dictionary. The human brain takes 20 years to learn language on a 10 petaflop, 1 petabyte neural network. So we are asking quite a bit. -- -- Matt Mahoney, mattmahone...@gmail.com ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T682a307a763c1ced-M1f60044363c6d90c81505bcc Delivery options: https://agi.topicbox.com/groups/agi/subscription