Matt, Nice break down. You've actually worked with language models, which makes it easier to bring it back to concrete examples.
On Tue, May 28, 2024 at 2:36 AM Matt Mahoney <mattmahone...@gmail.com> wrote: > > ...For grammar, AB predicts AB (n-grams), Yes, this looks like what we call "words". Repeated structure. No novelty. And nothing internal we can equate to "meaning" either. Only meaning by association. > and AB, CB, CD, predicts AD (learning the rule > {A,C}{B,D}). This is the interesting one. It actually kind of creates new meaning. You can think of "meaning" as a way of grouping things which makes good predictions. And, indeed, those gap filler sets {A,C} do pull together sets of words that we intuitively associate with similar meaning. These are also the sets that the HNet paper identifies as having "meaning" independent of any fixed pattern. A pattern can be new, and so long as it makes similar predictions {B,D}, for any set {B,D...}, {X,Y...}..., we can think of it as having "meaning", based on the fact that arranging the world that way, makes those shared predictions. (Even moving beyond language, you can say the atoms of a ball, share the meaning of a "ball", based on the fact they fly through the air together, and bounce off walls together. It's a way of defining what it "means" to be a "ball".) Now, let's try to get some more detail. How do compressors handle the case where you get {A,C} on the basis of AB, CB, but you don't get, say AX, CX? Which is to say, the rules contradict. Sometimes A and C are the same, but not other times. You want to trigger the "rule" so you can capture the symmetries. But you can't make a fixed "rule", saying {A,C}, because the symmetries only apply to particular sub-sets of contexts. You get a lot of this in natural language. There are many such shared context symmetries in language, but they contradict. Or they're "entangled". You get one by ordering contexts one way, and another by ordering contexts another way, but you can't get both at once, because you can't order contexts both ways at once. I later learned these contradictions were observed even at the level of phonemes, and this was crucial to Chomsky's argument that grammar could not be "learned", back in the '50s. That this essentially broke consensus in the field of linguistics. Which remains in squabbling sub-fields over this result, to this day. That's why theoretical linguistics contributes essentially nothing to contemporary machine learning. Has anyone ever wondered? Why don't linguists tell us how to build language models? Even the Chomsky hierarchy cited by James' DeepMind paper from the "learning" point of view is essentially a misapprehension of what Chomsky concluded (that observable grammar contradicts, so formal grammar can't be learned.) A reference available on the Web I've been able to find is this one: "Halle (1959, 1962) and especially Chomsky (1964) subjected Bloomfieldian phonemics to a devastating critique." Generative Phonology Michael Kenstowicz http://lingphil.mit.edu/papers/kenstowicz/generative_phonology.pdf But really it's totally ignored. Machine learning does not address this to my knowledge. I'd welcome references to anyone talking about its relevance for machine learning. I'm sure all the compression algorithms submitted to the Hutter Prize ignore this. Maybe I'm wrong. Have any addressed it? They probably just regress to some optimal compromise, and don't think about it too much. If we choose not to ignore this, what do we do? Well, we might try to "learn" all these contradictions, indexed on context. I think this is what LLMs do. By accident. That was the big jump, right, "attention", to index context. Then they just enumerate vast numbers of (an essentially infinite number of?) predictive patterns in one enormous training time.That's why they get so large. No-one knows, or wonders, why neural nets work for this, and symbols don't, viz. the topic post of this thread. But this will be the reason. In practice LLMs learn predictive patterns, and index them on context using "attention", and it turns out there are a lot of those different predictive "embeddings", indexed on context. There is no theory. Everything is a surprise. But if you go back in the literature, there are these results about contradictions to suggest why it might be so. And the conclusion is still either Chomsky's one, that language can't be learned, consistent rules exist, but must be innate. Or, what Chomsky didn't consider, that complexity of novel patterns defying abstraction, might be part of the solution. It was before the discovery of chaos when Chomsky was looking at this, so perhaps it's not fair to blame him for not considering it. But then it becomes a complexity issue. Just how many unique orderings of contexts with useful predictive symmetries are there? Are you ever at an end of finding different orderings of contexts, which specify some useful new predictive symmetry or other? The example of computational automata, and patterns which are eternally getting bigger and more complex, is relevant (distinct from push down automata, James?) This is an explanation for the limits of LLMs. They are trying to index chaos. The result will be a tangle, with no observable structure. And it can never be big enough to capture an infinity of patterns. If you think of it as creative complexity, though, it's a solution, not a problem. All these growing(?) numbers of predictive patterns, {A,C} and friends. More and more of them to find all the time. But contradicting at times, relying on an ordering in one context, which might contradict with an ordering in another context. All that, it's all just more "meaning" in the world, waiting to be found. The only thing is you can't "learn" it all at once in one big marathon, and leave it all in one big tangled mess together. You need to "collapse" the data, one way or another, at the time of observation. Or, I don't know. Maybe I'm missing something. Did any of the compressor language models you examined, ever look at anything like this? Or mention any kind of compromise, optimization, in response to contradictions? Or maybe I'm wrong. Maybe there is a single best set of abstraction groupings {A,C} somewhere, one which doesn't contradict, and we can label it, and finally present that perfect, symbolic, solution for AI. Maybe it will be a nested stack. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T682a307a763c1ced-Mebb2ead2f103af479ee8d8f1 Delivery options: https://agi.topicbox.com/groups/agi/subscription