On Wed, May 29, 2024 at 9:37 AM Matt Mahoney <mattmahone...@gmail.com> wrote:
>
> On Tue, May 28, 2024 at 7:46 AM Rob Freeman <chaotic.langu...@gmail.com> 
> wrote:
>
> > Now, let's try to get some more detail. How do compressors handle the
> > case where you get {A,C} on the basis of AB, CB, but you don't get,
> > say AX, CX? Which is to say, the rules contradict.
>
> Compressors handle contradictory predictions by averaging them

That's what I thought.

> > "Halle (1959, 1962) and especially Chomsky (1964) subjected
> > Bloomfieldian phonemics to a devastating critique."
> >
> > Generative Phonology
> > Michael Kenstowicz
> > http://lingphil.mit.edu/papers/kenstowicz/generative_phonology.pdf
> >
> > But really it's totally ignored. Machine learning does not address
> > this to my knowledge. I'd welcome references to anyone talking about
> > its relevance for machine learning.
>
> Phonology is mostly irrelevant to text prediction.

The point was it invalidated the method of learning linguistic
structure by distributional analysis at any level. If your rules for
phonemes contradict, what doesn't contradict?

Which is a pity. Because we still don't have a clue what governs
language structure. The best we've been able to come up with is crude
hacks like dragging a chunk of important context behind like a ball
and chain in LSTM, or multiplexing pre-guessed "tokens" together in a
big matrix, with "self-attention".

Anyway, your disinterest doesn't invalidate my claim that this result,
pointing to contradiction produced by distributional analysis learning
procedures for natural language, is totally ignored by current machine
learning, which implicitly or otherwise uses those distributional
analysis learning procedures.

> Language evolved to be learnable on neural networks faster than our
> brains evolved to learn language. So understanding our algorithm is
> important.
>
> Hutter prize entrants have to prebuild a lot of the model because
> computation is severely constrained (50 hours in a single thread with
> 10 GB memory). That includes a prebuilt dictionary. The human brain
> takes 20 years to learn language on a 10 petaflop, 1 petabyte neural
> network. So we are asking quite a bit.

Neural networks may have finally gained close to human performance at
prediction. A problem where you can cover a multitude of sins with raw
memory. Something at which computers trivially exceed humans by as
many orders of magnitude as you can stack server farms. You can just
remember each contradiction including the context which selects it. No
superior algorithm required, and certainly none in evidence. (Chinese
makes similar trade-offs, swapping internal mnemonic sound structure
within tokens, with prodigious memory requirements for the tokens
themselves.) Comparing 10 GB with 1 petabyte seems ingenuous. I
strongly doubt any human can recall as much as 10GB of text. (All of
Wikipedia currently ~22GB compressed, without media? Even to read it
all is estimated at 47 years, including 8hrs sleep a night
https://www.reddit.com/r/theydidthemath/comments/80fi3w/self_how_long_would_it_take_to_read_all_of/.
So forget 20 years to learn it, it would take 20 years to read all the
memory you give Prize entrants.) But I would argue our prediction
algorithms totally fail to do any sort of job with language structure.
Whereas you say babies start to structure language before they can
walk? (Walking being something else computers still have problems
with.) And far from stopping at word segmentation, babies go on to
build quite complex structures, including new ones all the time.

Current models do nothing with structure, not at human "data years"
8-10 months, not 77 years (680k hours of audio to train "Whisper" ~77
years? 
https://www.thealgorithmicbridge.com/p/8-features-make-openais-whisper-the.
Perhaps some phoneme structure might help there...) The only structure
is "tokens". I don't even think current algorithms do max entropy to
find words. They just start out with "tokens". Guessed at
pre-training. Here's Karpathy and LeCun talking about it:

Yann LeCun
@ylecun·Feb 21
Text tokenization is almost as much of an abomination for text as it
is for images. Not mentioning video.
...
Replying to @karpathy
We will see that a lot of weird behaviors and problems of LLMs
actually trace back to tokenization. We'll go through a number of
these issues, discuss why tokenization is at fault, and why someone
out there ideally finds a way to delete this stage entirely.

https://x.com/ylecun/status/1760315812345176343

By the way, talking about words. That's another thing which seems to
have contradictory structure in humans, e.g. native Chinese speakers
agree what constitutes a "word" less than 70% of the time:

"Sproat et. al. (1996) give empirical results showing that native
speakers of Chinese frequently agree on the correct segmentation in
less than 70% of the cases."
https://s3.amazonaws.com/tm-town-nlp-resources/ch2.pdf

I guess that will be:

Sproat, Richard W., Chilin Shih, William Gale, and Nancy Chang. 1996.
A stochastic finite-state word-segmentation algorithm for chinese.
Computational Linguistics, 22(3):377–404.

But the interesting one is relational encodings, of the {A,C} type you
described earlier. Not least because that is the one which can be
associated with meaning, including new meaning.

And you say that is dealt with in your experience by "averaging". I
guessed as much.

I suspect LLMs do better by indexing contradictory groupings on
context. Which blows out the memory. And is sterile, fixed at time of
training. And locked to the original choice of "tokens". And locked to
the length of the original "context window" in terms of that original
choice of "tokens". And obscures any "meaning" interpretation. But
surely indexing contradictory groupings on context helps to predict a
little better.

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T682a307a763c1ced-M1abdfefdeeba7ce2b5d8be93
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to