Is AI running out of training data?

John Clark Thu, 12 Dec 2024 10:09:08 -0800

*The number of "tokens" (words or parts of words) used to train LLMs is 100
times larger than it was in 2020, the largest are now using tens of
trillions.  if you only consider text then the entire Internet only
contains about 3,100 trillion tokens. The amount of text LLMs train on is
doubling every year but the amount of human generated text on the Internet
is only growing at about 10% a year, if that trend continues AIs will run
out of text somewhere around 2028.  Does that mean AI progress is about to
hit a wall? I don't think so for the following reasons:*


*For one thing, because of improvements in algorithms, the computing power
needed for a Large Language Model  to achieve the same performance has
halved about every 8 months. *

*ALGORITHMIC PROGRESS IN LANGUAGE MODELS* <https://arxiv.org/pdf/2403.05812>


*And computer chips specialized for AI rather than general computing, like
those made by Nvidia and other companies, are getting faster even more
rapidly than Moore's Law. Also, the rate of growth of specialized data
sets, such as astronomical and biological data, are growing much much more
quickly than text is; that's how AIs got so good at predicting how proteins
fold up. *

*And there is vastly more information if AI's are trained on other types of
data besides text, and some AI's are already being trained on unlabeled
images and videos.  Yann LeCun, chief AI scientist at Meta, said that
"although the 10^13  tokens used to train a LLM  sounds like a lot  (it
would take a human 170,000 years to read that much) , a 4-year-old child
has absorbed a volume of data 50 times greater than that just by looking at
objects during his waking hours. We’re never going to get to human-level AI
by just training on language, that’s just not happening".*

*And then there's synthetic data. AlphaGeometry was trained to solve
geometry problems using 100 million computer generated synthetic examples
with no human demonstrations, and it ended up being as good at solving
difficult geometry problems as the very best high school students in the
entire nation. *

*Solving olympiad geometry without human demonstrations*
<https://www.nature.com/articles/s41586-023-06747-5>

*AI researchers are starting to change their strategy and have their AI's
reread their training set many times because AI's operate in a statistical
way so rereading improves performance *


*Scaling Data-Constrained Language Models*
<https://arxiv.org/pdf/2305.16264>


*Andy Zou at Carnegie Mellon University says  "once  an AI has got a
foundational knowledge base that’s probably greater than any single person
could have, it no longer needs more data to get smarter. It just needs to
sit and think. I think we’re probably pretty close to that point.”*

*John K Clark    See what's on my new list at  Extropolis
<https://groups.google.com/g/extropolis>*
nps

-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/CAJPayv3WO0StRTKR%2BgFKiApBYFNgh2WFBZM8cqpLM2WsyHt9jw%40mail.gmail.com.

Is AI running out of training data?

Reply via email to