Magic! On Thursday, 12 December 2024 at 20:00:58 UTC+2 John Clark wrote:
> *The number of "tokens" (words or parts of words) used to train LLMs is > 100 times larger than it was in 2020, the largest are now using tens of > trillions. if you only consider text then the entire Internet only > contains about 3,100 trillion tokens. The amount of text LLMs train on is > doubling every year but the amount of human generated text on the Internet > is only growing at about 10% a year, if that trend continues AIs will run > out of text somewhere around 2028. Does that mean AI progress is about to > hit a wall? I don't think so for the following reasons:* > > *For one thing, because of improvements in algorithms, the computing power > needed for a Large Language Model to achieve the same performance has > halved about every 8 months. * > > *ALGORITHMIC PROGRESS IN LANGUAGE MODELS* > <https://arxiv.org/pdf/2403.05812> > > > *And computer chips specialized for AI rather than general computing, like > those made by Nvidia and other companies, are getting faster even more > rapidly than Moore's Law. Also, the rate of growth of specialized data > sets, such as astronomical and biological data, are growing much much more > quickly than text is; that's how AIs got so good at predicting how proteins > fold up. * > > *And there is vastly more information if AI's are trained on other types > of data besides text, and some AI's are already being trained on unlabeled > images and videos. Yann LeCun, chief AI scientist at Meta, said that > "although the 10^13 tokens used to train a LLM sounds like a lot (it > would take a human 170,000 years to read that much) , a 4-year-old child > has absorbed a volume of data 50 times greater than that just by looking at > objects during his waking hours. We’re never going to get to human-level AI > by just training on language, that’s just not happening".* > > *And then there's synthetic data. AlphaGeometry was trained to solve > geometry problems using 100 million computer generated synthetic examples > with no human demonstrations, and it ended up being as good at solving > difficult geometry problems as the very best high school students in the > entire nation. * > > *Solving olympiad geometry without human demonstrations* > <https://www.nature.com/articles/s41586-023-06747-5> > > *AI researchers are starting to change their strategy and have their AI's > reread their training set many times because AI's operate in a statistical > way so rereading improves performance * > > > *Scaling Data-Constrained Language Models* > <https://arxiv.org/pdf/2305.16264> > > > *Andy Zou at Carnegie Mellon University says "once an AI has got a > foundational knowledge base that’s probably greater than any single person > could have, it no longer needs more data to get smarter. It just needs to > sit and think. I think we’re probably pretty close to that point.”* > > *John K Clark See what's on my new list at Extropolis > <https://groups.google.com/g/extropolis>* > nps > > > > > > > -- You received this message because you are subscribed to the Google Groups "Everything List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/everything-list/87d36fd7-9b3d-44e7-8bf7-885e87eca4e4n%40googlegroups.com.

