Considering you reply, I would like to clarify a few points from my perspective. First, I’m not a skeptic who wants AI to fail; on the contrary, I see value in developing more rigorous benchmarks that push AI beyond narrow optimization. While it’s true that scaling a model’s compute often improves performance—e.g., O3 going from 75.7% to 87.5%—that alone doesn’t prove that we’ve achieved “fundamental AGI.” In machine learning history, many benchmarks have been surpassed by throwing more resources at them, yet models often fail when faced with novel tasks.
Regarding the ARC-AGI-1 Public Training set, it’s obviously not entirely “cheating” to use it, and the effort is impressive, but when a system is trained on data very similar to the test set, there’s that risk of overfitting rather than demonstrating more genuine adaptability. Real-world intelligence typically shows up when an agent can handle new, unseen challenges without relying on repeated exposure of similar ones. Human children, for example, often solve a large portion of the ARC puzzles without specialized training or hyperparameter tuning. I’ve personally seen kids under age ten handle about 80 to upwards of 90% of the daily “play” tasks (besides acing the 6 problems on the landing page) on the ARC site once they grasp the basic rule of finding the rule, which suggests these particular puzzles might not be the best proxy for broad or “general” intelligence. They are quite fun, actually. As for the claim that François Chollet “moved the goalpost” once AI systems approached the 75% mark, it’s common in AI research for benchmarks to evolve precisely because scoring high on an older test doesn’t necessarily reflect deep, generalizable reasoning. The purpose of creating tougher challenges, such as the upcoming ARC-AGI-2, is not to deny progress but to ensure that models actually show robust capabilities rather than specialized or memorized skills. Humans may also find these new tasks difficult, but if the tests do a better job of measuring multi-domain adaptability, then both AI and human performance can be evaluated in a more meaningful way. In short, I *do *share your excitement about recent strides in AI and welcome the idea that we should keep updating our tests as they become outmoded. I just want those tests to demand broader, more convincing reasoning and more novel problem-solving, rather than repeating data a model has already seen or memorized, spending millions on compute, and “tuning” on the fly by devs. We’re all on the same page in wanting to drive AI forward—and robust, carefully designed benchmarks help us see how we’re approaching the goal of increasingly “general” intelligence for practical purposes. On Saturday, December 21, 2024 at 10:01:40 PM UTC+8 John Clark wrote: > On Sat, Dec 21, 2024 at 5:14 AM PGC <[email protected]> wrote: > >> * > there is a statement that when the system is scaled up dramatically >> (172 times more compute resources), it manages to score 87.5%. The >> difference between the 75.7% result and 87.5% result is thus explained by a >> large disparity in the computational budget used for training or inference.* >> > *Yes, and if O3 had been given even more time it would've scored even > higher, to me that indicates that the fundamental problem of AGI has been > solved, and now it's just a question of optimizing things to make them more > efficient. And if history is any guide that won't take long, today much > smaller more compute efficient models can equal the performance of huge > compute hungry state of the art models of just a few months ago. * > > *It's bizarre to realize that just a month and a half ago the majority of > people in the USA thought the major problems facing the country were the > trivial issues of illegal immigration and transsexual bathrooms, and that's > why Donald Trump will be the most powerful hominid on earth during the most > critical period in the entire history of his Homo sapiens species. * > >> >> *> the model was explicitly trained on the very same data (or a >> substantial subset of it) against which it was later tested. The text >> itself says: “trained on the ARC-AGI-1 Public Training set”* >> > > *I don't see how the fact that O3 was trained on the ARC-AGI-1 Public > Training set could be considered cheating when the ARC people are the ones > who released the ARC-AGI-1 Public Training set for the precise purpose, as > its name indicates, of training AIs.* > > *> Beyond the bare mention of “trained on the ARC-AGI-1 Public Training >> set,” there is an implied process of repeated tuning or hyperparameter >> searches.* >> > *Yes, because that's what "training an AI" means! * > > *> children’s ability to adapt to novel tasks and generalize without being >> artificially “trained” on the same data is a key part of the skepticism:* >> > > *Human children need to go to school, so do newly born childish AIs. * > > > >> >> *> a quote from the blog:"Passing ARC-AGI does not equate to achieving >> AGI, and, as a matter of fact, I don't think o3 is AGI yet." * > > > *The average human taking the ARC test will receive a score of about 50%, > some very exceptionally talented humans can get a score of around 80%. > About one year ago, back in the stone age when the best AI's only scored > about 2% on the ARC test, Francois Chollet, the author of the above > quote and the originator of the ARC test, said that if a computer got a > score above 75% he would consider it an AGI. But now that O3 can get a > score of 87.5% if it thinks for a long time and 75.7% if it is only allowed > a short time to think, Chollet has done what all AI skeptics have done > since the 1960s, he has moved the goal post. * > > >> *> Furthermore, early data points suggest that the upcoming ARC-AGI-2 >> benchmark will still pose a significant challenge to o3,* >> > > *Yes, I'm certain computers will find it more difficult to get a high > score on ARC-AGI-2, but human beings will find this new test to be even > more difficult than computers do. Today's benchmarks are becoming obsolete > because computers are rapidlymaxing them out, that's why we need ARC-AGI-2, > it will be very useful in comparing one AGI to another AGI.* > > *John K Clark See what's on my new list at Extropolis > <https://groups.google.com/g/extropolis>* > 4n1 > > >> >> >> -- You received this message because you are subscribed to the Google Groups "Everything List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/everything-list/fec94105-c164-408f-8cf8-f3c2c1972178n%40googlegroups.com.

