Not enough detail for any conclusions. According to the blog post on 
arcprize.org (see here: https://arcprize.org/blog/oai-o3-pub-breakthrough), 

*“OpenAI’s new o3 system—**trained on the ARC-AGI-1 Public Training set**—has 
scored a breakthrough 75.7% on the Semi-Private Evaluation set at our 
stated public leaderboard $10k compute limit. A high-compute (172x) o3 
configuration scored 87.5%.” *

This excerpt is central to the discussion. The blog is announcing what it 
calls a “breakthrough” result, attributing the model’s performance on an 
evaluation set to the new “o3 system.” The mention of the “$10k compute 
limit” probably refers to a constraint or budget allocated for training 
and/or inference on the public leaderboard. Additionally, there is a 
statement that when the system is scaled up dramatically (172 times more 
compute resources), it manages to score 87.5%. The difference between the 
75.7% result and 87.5% result is thus explained by a large disparity in the 
computational budget used for training or inference.

More significantly, as I’ve highlighted in bold, the model was explicitly 
trained on the very same data (or a substantial subset of it) against which 
it was later tested. The text itself says: “trained on the ARC-AGI-1 Public 
Training set” and then, in the next phrase, reports scores on “the 
Semi-Private Evaluation set,” which is presumably meant to be the test 
portion of that same overall dataset (or at least closely related). While 
it is possible in machine learning to maintain a strictly separate portion 
of data for testing, *the mention of “Semi-Private” invites questions about 
how distinct or withheld that portion really is.* If the “Semi-Private 
Evaluation set” is derived from the same overall data distribution used in 
training, or if it contains overlapping examples, then the resulting 75.7% 
or 87.5% scores might reflect overfitting/memorization more than genuine 
progress toward robust, generalizable intelligence.

The separation of training data from test or evaluation data is critical to 
ensure that performance metrics capture generalization, rather than the 
model having “seen” or memorized the answers in training. When a blog post 
highlights a “breakthrough” but simultaneously acknowledges that the data 
used to measure said breakthrough was closely related to the training set, 
skeptics like yours truly naturally question whether this milestone is more 
about tuning to a known distribution instead of a leap in fundamental 
capabilities. Memorization is not reasoning, as I’ve stated many times 
before. But this falls on deaf ears here all the time.

Beyond the bare mention of “trained on the ARC-AGI-1 Public Training set,” 
there is an implied process of repeated tuning or hyperparameter searches. 
If the developers iterated many times over that dataset—adjusting 
parameters, architecture decisions, or training strategies to maximize 
performance on that very same distribution—then the reported results are 
likely inflated. In other words, repeated attempts to boost the benchmark 
score can lead to an overly optimistic portrayal of performance. Everybody 
knows that this becomes “benchmark gaming,” or in milder terms, an 
accidental overfitting to the validation/test data that was supposed to be 
isolated.

The blog post mentions two different performance figures: one achieved 
under the publicly stated $10k compute limit, and another, much higher 
score (87.5%) when the system was scaled up 172 times in terms of compute 
expenditure. Consider the costliness of such experiments for so small of a 
performance bump! That’s not a positive sign for optimists regarding this 
question.

AI models can indeed be improved—sometimes dramatically—by throwing more 
compute at them. However, from the perspective of practicality or genuine 
progress, it may be less impressive if the improvement depends purely on 
scaling up hardware resources by a large factor, rather than demonstrating 
a new or more efficient approach to learning and reasoning. Is this 
warranted if the tasks the model is solving do not necessarily require 
advanced reasoning skills that a much smaller system (or a human child) can 
handle in a simpler way?

If a large-scale AI system with a massive compute budget is merely matching 
or modestly exceeding the performance that a human child can achieve, it 
undercuts the notion of a major “breakthrough.” Additionally, children’s 
ability to adapt to novel tasks and generalize without being artificially 
“trained” on the same data is a key part of the skepticism: the kind of 
intelligence the AI system is demonstrating might be narrower or more 
brittle compared to natural, human-like intelligence.

All of this underscores how these “breakthrough” claims can be misleading 
if not accompanied by rigorous methodology (e.g., truly held-out data, 
minimal overlap, reproducible results under consistent compute budgets). 
While the raw numbers of 75.7% and 87.5% might look impressive at face 
value, the context provided—including the fact that the ARC-AGI-1 dataset 
was also used for training—casts doubt on the significance of those scores 
as an indicator of robust progress in AI or alignment research.

I leave you with a quote from the blog: 

*Passing ARC-AGI does not equate to achieving AGI, and, as a matter of 
fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, 
indicating fundamental differences with human intelligence.*

*Furthermore, early data points suggest that the upcoming ARC-AGI-2 
benchmark will still pose a significant challenge to o3, potentially 
reducing its score to under 30% even at high compute (while a smart human 
would still be able to score over 95% with no training). This demonstrates 
the continued possibility of creating challenging, unsaturated benchmarks 
without having to rely on expert domain knowledge. You'll know AGI is here 
when the exercise of creating tasks that are easy for regular humans but 
hard for AI becomes simply impossible.*

And even with that, the memory vs. reasoning problem doesn’t vanish. You 
believe that your interlocutor is generally intelligent or not. But I’ve 
repeated this so many times, it’s getting too time consuming to keep 
responding in detail. Thanks to John anyway for posting past all the 
narcissism needing therapy spam here. It’s getting tedious, I have to agree 
with Quentin. Less and less posts with big picture + good level of nuances. 

On Saturday, December 21, 2024 at 5:17:52 PM UTC+8 Cosmin Visan wrote:

> @Brent. Shut up you woke communist! In case you don't know, you are a 
> straight white male. It that woke feminazi would have won the election, you 
> would have been the first to be exterminated. Be glad that Trump won! 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/cc561c2f-598b-4d58-8ab1-6b31e42c27afn%40googlegroups.com.

Reply via email to