On Wed, May 29, 2024 at 9:37 AM Matt Mahoney <mattmahone...@gmail.com> wrote: > > On Tue, May 28, 2024 at 7:46 AM Rob Freeman <chaotic.langu...@gmail.com> > wrote: > > > Now, let's try to get some more detail. How do compressors handle the > > case where you get {A,C} on the basis of AB, CB, but you don't get, > > say AX, CX? Which is to say, the rules contradict. > > Compressors handle contradictory predictions by averaging them
That's what I thought. > > "Halle (1959, 1962) and especially Chomsky (1964) subjected > > Bloomfieldian phonemics to a devastating critique." > > > > Generative Phonology > > Michael Kenstowicz > > http://lingphil.mit.edu/papers/kenstowicz/generative_phonology.pdf > > > > But really it's totally ignored. Machine learning does not address > > this to my knowledge. I'd welcome references to anyone talking about > > its relevance for machine learning. > > Phonology is mostly irrelevant to text prediction. The point was it invalidated the method of learning linguistic structure by distributional analysis at any level. If your rules for phonemes contradict, what doesn't contradict? Which is a pity. Because we still don't have a clue what governs language structure. The best we've been able to come up with is crude hacks like dragging a chunk of important context behind like a ball and chain in LSTM, or multiplexing pre-guessed "tokens" together in a big matrix, with "self-attention". Anyway, your disinterest doesn't invalidate my claim that this result, pointing to contradiction produced by distributional analysis learning procedures for natural language, is totally ignored by current machine learning, which implicitly or otherwise uses those distributional analysis learning procedures. > Language evolved to be learnable on neural networks faster than our > brains evolved to learn language. So understanding our algorithm is > important. > > Hutter prize entrants have to prebuild a lot of the model because > computation is severely constrained (50 hours in a single thread with > 10 GB memory). That includes a prebuilt dictionary. The human brain > takes 20 years to learn language on a 10 petaflop, 1 petabyte neural > network. So we are asking quite a bit. Neural networks may have finally gained close to human performance at prediction. A problem where you can cover a multitude of sins with raw memory. Something at which computers trivially exceed humans by as many orders of magnitude as you can stack server farms. You can just remember each contradiction including the context which selects it. No superior algorithm required, and certainly none in evidence. (Chinese makes similar trade-offs, swapping internal mnemonic sound structure within tokens, with prodigious memory requirements for the tokens themselves.) Comparing 10 GB with 1 petabyte seems ingenuous. I strongly doubt any human can recall as much as 10GB of text. (All of Wikipedia currently ~22GB compressed, without media? Even to read it all is estimated at 47 years, including 8hrs sleep a night https://www.reddit.com/r/theydidthemath/comments/80fi3w/self_how_long_would_it_take_to_read_all_of/. So forget 20 years to learn it, it would take 20 years to read all the memory you give Prize entrants.) But I would argue our prediction algorithms totally fail to do any sort of job with language structure. Whereas you say babies start to structure language before they can walk? (Walking being something else computers still have problems with.) And far from stopping at word segmentation, babies go on to build quite complex structures, including new ones all the time. Current models do nothing with structure, not at human "data years" 8-10 months, not 77 years (680k hours of audio to train "Whisper" ~77 years? https://www.thealgorithmicbridge.com/p/8-features-make-openais-whisper-the. Perhaps some phoneme structure might help there...) The only structure is "tokens". I don't even think current algorithms do max entropy to find words. They just start out with "tokens". Guessed at pre-training. Here's Karpathy and LeCun talking about it: Yann LeCun @ylecun·Feb 21 Text tokenization is almost as much of an abomination for text as it is for images. Not mentioning video. ... Replying to @karpathy We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. https://x.com/ylecun/status/1760315812345176343 By the way, talking about words. That's another thing which seems to have contradictory structure in humans, e.g. native Chinese speakers agree what constitutes a "word" less than 70% of the time: "Sproat et. al. (1996) give empirical results showing that native speakers of Chinese frequently agree on the correct segmentation in less than 70% of the cases." https://s3.amazonaws.com/tm-town-nlp-resources/ch2.pdf I guess that will be: Sproat, Richard W., Chilin Shih, William Gale, and Nancy Chang. 1996. A stochastic finite-state word-segmentation algorithm for chinese. Computational Linguistics, 22(3):377–404. But the interesting one is relational encodings, of the {A,C} type you described earlier. Not least because that is the one which can be associated with meaning, including new meaning. And you say that is dealt with in your experience by "averaging". I guessed as much. I suspect LLMs do better by indexing contradictory groupings on context. Which blows out the memory. And is sterile, fixed at time of training. And locked to the original choice of "tokens". And locked to the length of the original "context window" in terms of that original choice of "tokens". And obscures any "meaning" interpretation. But surely indexing contradictory groupings on context helps to predict a little better. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T682a307a763c1ced-M1abdfefdeeba7ce2b5d8be93 Delivery options: https://agi.topicbox.com/groups/agi/subscription