Re: Is AI running out of training data?

'Cosmin Visan' via Everything List Thu, 12 Dec 2024 13:39:51 -0800

Magic!

On Thursday, 12 December 2024 at 20:00:58 UTC+2 John Clark wrote:


> *The number of "tokens" (words or parts of words) used to train LLMs is 
> 100 times larger than it was in 2020, the largest are now using tens of 
> trillions.  if you only consider text then the entire Internet only 
> contains about 3,100 trillion tokens. The amount of text LLMs train on is 
> doubling every year but the amount of human generated text on the Internet 
> is only growing at about 10% a year, if that trend continues AIs will run 
> out of text somewhere around 2028.  Does that mean AI progress is about to 
> hit a wall? I don't think so for the following reasons:*
>
> *For one thing, because of improvements in algorithms, the computing power 
> needed for a Large Language Model  to achieve the same performance has 
> halved about every 8 months. *
>
> *ALGORITHMIC PROGRESS IN LANGUAGE MODELS* 
> <https://arxiv.org/pdf/2403.05812>
>
>
> *And computer chips specialized for AI rather than general computing, like 
> those made by Nvidia and other companies, are getting faster even more 
> rapidly than Moore's Law. Also, the rate of growth of specialized data 
> sets, such as astronomical and biological data, are growing much much more 
> quickly than text is; that's how AIs got so good at predicting how proteins 
> fold up. *
>
> *And there is vastly more information if AI's are trained on other types 
> of data besides text, and some AI's are already being trained on unlabeled 
> images and videos.  Yann LeCun, chief AI scientist at Meta, said that 
> "although the 10^13  tokens used to train a LLM  sounds like a lot  (it 
> would take a human 170,000 years to read that much) , a 4-year-old child 
> has absorbed a volume of data 50 times greater than that just by looking at 
> objects during his waking hours. We’re never going to get to human-level AI 
> by just training on language, that’s just not happening".*
>
> *And then there's synthetic data. AlphaGeometry was trained to solve 
> geometry problems using 100 million computer generated synthetic examples 
> with no human demonstrations, and it ended up being as good at solving 
> difficult geometry problems as the very best high school students in the 
> entire nation. *
>
> *Solving olympiad geometry without human demonstrations* 
> <https://www.nature.com/articles/s41586-023-06747-5>
>
> *AI researchers are starting to change their strategy and have their AI's 
> reread their training set many times because AI's operate in a statistical 
> way so rereading improves performance *
>
>
> *Scaling Data-Constrained Language Models* 
> <https://arxiv.org/pdf/2305.16264>
>
>
> *Andy Zou at Carnegie Mellon University says  "once  an AI has got a 
> foundational knowledge base that’s probably greater than any single person 
> could have, it no longer needs more data to get smarter. It just needs to 
> sit and think. I think we’re probably pretty close to that point.”*
>
> *John K Clark    See what's on my new list at  Extropolis 
> <https://groups.google.com/g/extropolis>*
> nps
>
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/87d36fd7-9b3d-44e7-8bf7-885e87eca4e4n%40googlegroups.com.

Re: Is AI running out of training data?

Reply via email to