Matt,

Nice break down. You've actually worked with language models, which
makes it easier to bring it back to concrete examples.

On Tue, May 28, 2024 at 2:36 AM Matt Mahoney <mattmahone...@gmail.com> wrote:
>
> ...For grammar, AB predicts AB (n-grams),

Yes, this looks like what we call "words". Repeated structure. No
novelty. And nothing internal we can equate to "meaning" either. Only
meaning by association.

> and AB, CB, CD, predicts AD (learning the rule
> {A,C}{B,D}).

This is the interesting one. It actually kind of creates new meaning.
You can think of "meaning" as a way of grouping things which makes
good predictions. And, indeed, those gap filler sets {A,C} do pull
together sets of words that we intuitively associate with similar
meaning. These are also the sets that the HNet paper identifies as
having "meaning" independent of any fixed pattern. A pattern can be
new, and so long as it makes similar predictions {B,D}, for any set
{B,D...}, {X,Y...}..., we can think of it as having "meaning", based
on the fact that arranging the world that way, makes those shared
predictions. (Even moving beyond language, you can say the atoms of a
ball, share the meaning of a "ball", based on the fact they fly
through the air together, and bounce off walls together. It's a way of
defining what it "means" to be a "ball".)

Now, let's try to get some more detail. How do compressors handle the
case where you get {A,C} on the basis of AB, CB, but you don't get,
say AX, CX? Which is to say, the rules contradict. Sometimes A and C
are the same, but not other times. You want to trigger the "rule" so
you can capture the symmetries. But you can't make a fixed "rule",
saying {A,C}, because the symmetries only apply to particular sub-sets
of contexts.

You get a lot of this in natural language. There are many such shared
context symmetries in language, but they contradict. Or they're
"entangled". You get one by ordering contexts one way, and another by
ordering contexts another way, but you can't get both at once, because
you can't order contexts both ways at once.

I later learned these contradictions were observed even at the level
of phonemes, and this was crucial to Chomsky's argument that grammar
could not be "learned", back in the '50s. That this essentially broke
consensus in the field of linguistics. Which remains in squabbling
sub-fields over this result, to this day. That's why theoretical
linguistics contributes essentially nothing to contemporary machine
learning. Has anyone ever wondered? Why don't linguists tell us how to
build language models? Even the Chomsky hierarchy cited by James'
DeepMind paper from the "learning" point of view is essentially a
misapprehension of what Chomsky concluded (that observable grammar
contradicts, so formal grammar can't be learned.)

A reference available on the Web I've been able to find is this one:

"Halle (1959, 1962) and especially Chomsky (1964) subjected
Bloomfieldian phonemics to a devastating critique."

Generative Phonology
Michael Kenstowicz
http://lingphil.mit.edu/papers/kenstowicz/generative_phonology.pdf

But really it's totally ignored. Machine learning does not address
this to my knowledge. I'd welcome references to anyone talking about
its relevance for machine learning.

I'm sure all the compression algorithms submitted to the Hutter Prize
ignore this. Maybe I'm wrong. Have any addressed it? They probably
just regress to some optimal compromise, and don't think about it too
much.

If we choose not to ignore this, what do we do? Well, we might try to
"learn" all these contradictions, indexed on context. I think this is
what LLMs do. By accident. That was the big jump, right, "attention",
to index context. Then they just enumerate vast numbers of (an
essentially infinite number of?) predictive patterns in one enormous
training time.That's why they get so large.

No-one knows, or wonders, why neural nets work for this, and symbols
don't, viz. the topic post of this thread. But this will be the
reason.

In practice LLMs learn predictive patterns, and index them on context
using "attention", and it turns out there are a lot of those different
predictive "embeddings", indexed on context. There is no theory.
Everything is a surprise. But if you go back in the literature, there
are these results about contradictions to suggest why it might be so.
And the conclusion is still either Chomsky's one, that language can't
be learned, consistent rules exist, but must be innate. Or, what
Chomsky didn't consider, that complexity of novel patterns defying
abstraction, might be part of the solution. It was before the
discovery of chaos when Chomsky was looking at this, so perhaps it's
not fair to blame him for not considering it.

But then it becomes a complexity issue. Just how many unique orderings
of contexts with useful predictive symmetries are there? Are you ever
at an end of finding different orderings of contexts, which specify
some useful new predictive symmetry or other? The example of
computational automata, and patterns which are eternally getting
bigger and more complex, is relevant (distinct from push down
automata, James?) This is an explanation for the limits of LLMs. They
are trying to index chaos. The result will be a tangle, with no
observable structure. And it can never be big enough to capture an
infinity of patterns.

If you think of it as creative complexity, though, it's a solution,
not a problem. All these growing(?) numbers of predictive patterns,
{A,C} and friends. More and more of them to find all the time. But
contradicting at times, relying on an ordering in one context, which
might contradict with an ordering in another context. All that, it's
all just more "meaning" in the world, waiting to be found. The only
thing is you can't "learn" it all at once in one big marathon, and
leave it all in one big tangled mess together. You need to "collapse"
the data, one way or another, at the time of observation.

Or, I don't know. Maybe I'm missing something. Did any of the
compressor language models you examined, ever look at anything like
this? Or mention any kind of compromise, optimization, in response to
contradictions?

Or maybe I'm wrong. Maybe there is a single best set of abstraction
groupings {A,C} somewhere, one which doesn't contradict, and we can
label it, and finally present that perfect, symbolic, solution for AI.
Maybe it will be a nested stack.

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T682a307a763c1ced-Mebb2ead2f103af479ee8d8f1
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to