Considering you reply, I would like to clarify a few points from my 
perspective. First, I’m not a skeptic who wants AI to fail; on the 
contrary, I see value in developing more rigorous benchmarks that push AI 
beyond narrow optimization. While it’s true that scaling a model’s compute 
often improves performance—e.g., O3 going from 75.7% to 87.5%—that alone 
doesn’t prove that we’ve achieved “fundamental AGI.” In machine learning 
history, many benchmarks have been surpassed by throwing more resources at 
them, yet models often fail when faced with novel tasks.

Regarding the ARC-AGI-1 Public Training set, it’s obviously not entirely 
“cheating” to use it, and the effort is impressive, but when a system is 
trained on data very similar to the test set, there’s that risk of 
overfitting rather than demonstrating more genuine adaptability. Real-world 
intelligence typically shows up when an agent can handle new, unseen 
challenges without relying on repeated exposure of similar ones. Human 
children, for example, often solve a large portion of the ARC puzzles 
without specialized training or hyperparameter tuning. I’ve personally seen 
kids under age ten handle about 80 to upwards of 90% of the daily “play” 
tasks (besides acing the 6 problems on the landing page) on the ARC site 
once they grasp the basic rule of finding the rule, which suggests these 
particular puzzles might not be the best proxy for broad or “general” 
intelligence. They are quite fun, actually. 

As for the claim that François Chollet “moved the goalpost” once AI systems 
approached the 75% mark, it’s common in AI research for benchmarks to 
evolve precisely because scoring high on an older test doesn’t necessarily 
reflect deep, generalizable reasoning. The purpose of creating tougher 
challenges, such as the upcoming ARC-AGI-2, is not to deny progress but to 
ensure that models actually show robust capabilities rather than 
specialized or memorized skills. Humans may also find these new tasks 
difficult, but if the tests do a better job of measuring multi-domain 
adaptability, then both AI and human performance can be evaluated in a more 
meaningful way.

In short, I *do *share your excitement about recent strides in AI and 
welcome the idea that we should keep updating our tests as they become 
outmoded. I just want those tests to demand broader, more convincing 
reasoning and more novel problem-solving, rather than repeating data a 
model has already seen or memorized, spending millions on compute, and 
“tuning” on the fly by devs. We’re all on the same page in wanting to drive 
AI forward—and robust, carefully designed benchmarks help us see how we’re 
approaching the goal of increasingly “general” intelligence for practical 
purposes. 

On Saturday, December 21, 2024 at 10:01:40 PM UTC+8 John Clark wrote:

> On Sat, Dec 21, 2024 at 5:14 AM PGC <[email protected]> wrote:
>
>> * > there is a statement that when the system is scaled up dramatically 
>> (172 times more compute resources), it manages to score 87.5%. The 
>> difference between the 75.7% result and 87.5% result is thus explained by a 
>> large disparity in the computational budget used for training or inference.*
>>
> *Yes, and if O3 had been given even more time it would've scored even 
> higher, to me that indicates that the fundamental problem of AGI has been 
> solved, and now it's just a question of optimizing things to make them more 
> efficient. And if history is any guide that won't take long, today much 
> smaller more compute efficient models can equal the performance of huge 
> compute hungry state of the art models of just a few months ago. *
>
> *It's bizarre to realize that just a month and a half ago the majority of 
> people in the USA thought the major problems facing the country were the 
> trivial issues of illegal immigration and transsexual bathrooms, and that's 
> why Donald Trump will be the most powerful hominid on earth during the most 
> critical period in the entire history of his Homo sapiens species.  *
>  
>>
>> *> the model was explicitly trained on the very same data (or a 
>> substantial subset of it) against which it was later tested. The text 
>> itself says: “trained on the ARC-AGI-1 Public Training set”*
>>
>
> *I don't see how the fact that O3 was trained on the ARC-AGI-1 Public 
> Training set could be considered cheating when the ARC people are the ones 
> who released the ARC-AGI-1 Public Training set for the precise purpose, as 
> its name indicates, of training AIs.*
>
> *> Beyond the bare mention of “trained on the ARC-AGI-1 Public Training 
>> set,” there is an implied process of repeated tuning or hyperparameter 
>> searches.*
>>
> *Yes, because that's what "training an AI" means!  *
>
> *> children’s ability to adapt to novel tasks and generalize without being 
>> artificially “trained” on the same data is a key part of the skepticism:*
>>
>
> *Human children need to go to school, so do newly born childish AIs. *
>
>  
>
>>
>> *> a quote from the blog:"Passing ARC-AGI does not equate to achieving 
>> AGI, and, as a matter of fact, I don't think o3 is AGI yet." *
>
>
> *The average human taking the ARC test will receive a score of about 50%, 
> some very exceptionally talented humans can get a score of around 80%. 
> About one year ago, back in the stone age when the best AI's only scored 
> about 2% on the ARC test, Francois Chollet, the author of the above 
> quote and the originator of the ARC test, said that if a computer got a 
> score above 75% he would consider it an AGI. But now that O3 can get a 
> score of 87.5% if it thinks for a long time and 75.7% if it is only allowed 
> a short time to think, Chollet has done what all AI skeptics have done 
> since the 1960s, he has moved the goal post. * 
>  
>
>> *> Furthermore, early data points suggest that the upcoming ARC-AGI-2 
>> benchmark will still pose a significant challenge to o3,*
>>
>
> *Yes, I'm certain computers will find it more difficult to get a high 
> score on ARC-AGI-2, but human beings will find this new test to be even 
> more difficult than computers do. Today's benchmarks are becoming obsolete 
> because computers are rapidlymaxing them out, that's why we need ARC-AGI-2, 
> it will be very useful in comparing one AGI to another AGI.*
>
> *John K Clark    See what's on my new list at  Extropolis 
> <https://groups.google.com/g/extropolis>*
> 4n1
>  
>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/fec94105-c164-408f-8cf8-f3c2c1972178n%40googlegroups.com.

Reply via email to