Re: [Extropolis] AGI has been achieved!

PGC Sat, 21 Dec 2024 02:14:09 -0800

 

Not enough detail for any conclusions. According to the blog post on 
arcprize.org (see here: https://arcprize.org/blog/oai-o3-pub-breakthrough),

*“OpenAI’s new o3 system—**trained on the ARC-AGI-1 Public Training set**—has
scored a breakthrough 75.7% on the Semi-Private Evaluation set at our
stated public leaderboard $10k compute limit. A high-compute (172x) o3
configuration scored 87.5%.” *

This excerpt is central to the discussion. The blog is announcing what it
calls a “breakthrough” result, attributing the model’s performance on an
evaluation set to the new “o3 system.” The mention of the “$10k compute
limit” probably refers to a constraint or budget allocated for training
and/or inference on the public leaderboard. Additionally, there is a
statement that when the system is scaled up dramatically (172 times more
compute resources), it manages to score 87.5%. The difference between the
75.7% result and 87.5% result is thus explained by a large disparity in the
computational budget used for training or inference.

More significantly, as I’ve highlighted in bold, the model was explicitly
trained on the very same data (or a substantial subset of it) against which
it was later tested. The text itself says: “trained on the ARC-AGI-1 Public
Training set” and then, in the next phrase, reports scores on “the
Semi-Private Evaluation set,” which is presumably meant to be the test
portion of that same overall dataset (or at least closely related). While
it is possible in machine learning to maintain a strictly separate portion
of data for testing, *the mention of “Semi-Private” invites questions about
how distinct or withheld that portion really is.* If the “Semi-Private
Evaluation set” is derived from the same overall data distribution used in
training, or if it contains overlapping examples, then the resulting 75.7%
or 87.5% scores might reflect overfitting/memorization more than genuine
progress toward robust, generalizable intelligence.

The separation of training data from test or evaluation data is critical to
ensure that performance metrics capture generalization, rather than the
model having “seen” or memorized the answers in training. When a blog post
highlights a “breakthrough” but simultaneously acknowledges that the data
used to measure said breakthrough was closely related to the training set,
skeptics like yours truly naturally question whether this milestone is more
about tuning to a known distribution instead of a leap in fundamental
capabilities. Memorization is not reasoning, as I’ve stated many times
before. But this falls on deaf ears here all the time.

Beyond the bare mention of “trained on the ARC-AGI-1 Public Training set,”
there is an implied process of repeated tuning or hyperparameter searches.
If the developers iterated many times over that dataset—adjusting
parameters, architecture decisions, or training strategies to maximize
performance on that very same distribution—then the reported results are
likely inflated. In other words, repeated attempts to boost the benchmark
score can lead to an overly optimistic portrayal of performance. Everybody
knows that this becomes “benchmark gaming,” or in milder terms, an
accidental overfitting to the validation/test data that was supposed to be
isolated.

The blog post mentions two different performance figures: one achieved
under the publicly stated $10k compute limit, and another, much higher
score (87.5%) when the system was scaled up 172 times in terms of compute
expenditure. Consider the costliness of such experiments for so small of a
performance bump! That’s not a positive sign for optimists regarding this
question.

AI models can indeed be improved—sometimes dramatically—by throwing more
compute at them. However, from the perspective of practicality or genuine
progress, it may be less impressive if the improvement depends purely on
scaling up hardware resources by a large factor, rather than demonstrating
a new or more efficient approach to learning and reasoning. Is this
warranted if the tasks the model is solving do not necessarily require
advanced reasoning skills that a much smaller system (or a human child) can
handle in a simpler way?

If a large-scale AI system with a massive compute budget is merely matching
or modestly exceeding the performance that a human child can achieve, it
undercuts the notion of a major “breakthrough.” Additionally, children’s
ability to adapt to novel tasks and generalize without being artificially
“trained” on the same data is a key part of the skepticism: the kind of
intelligence the AI system is demonstrating might be narrower or more
brittle compared to natural, human-like intelligence.

All of this underscores how these “breakthrough” claims can be misleading
if not accompanied by rigorous methodology (e.g., truly held-out data,
minimal overlap, reproducible results under consistent compute budgets).
While the raw numbers of 75.7% and 87.5% might look impressive at face
value, the context provided—including the fact that the ARC-AGI-1 dataset
was also used for training—casts doubt on the significance of those scores
as an indicator of robust progress in AI or alignment research.

I leave you with a quote from the blog:

*Passing ARC-AGI does not equate to achieving AGI, and, as a matter of
fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks,
indicating fundamental differences with human intelligence.*

*Furthermore, early data points suggest that the upcoming ARC-AGI-2
benchmark will still pose a significant challenge to o3, potentially
reducing its score to under 30% even at high compute (while a smart human
would still be able to score over 95% with no training). This demonstrates
the continued possibility of creating challenging, unsaturated benchmarks
without having to rely on expert domain knowledge. You'll know AGI is here
when the exercise of creating tasks that are easy for regular humans but
hard for AI becomes simply impossible.*

And even with that, the memory vs. reasoning problem doesn’t vanish. You
believe that your interlocutor is generally intelligent or not. But I’ve
repeated this so many times, it’s getting too time consuming to keep
responding in detail. Thanks to John anyway for posting past all the
narcissism needing therapy spam here. It’s getting tedious, I have to agree
with Quentin. Less and less posts with big picture + good level of nuances.

On Saturday, December 21, 2024 at 5:17:52 PM UTC+8 Cosmin Visan wrote:

> @Brent. Shut up you woke communist! In case you don't know, you are a
> straight white male. It that woke feminazi would have won the election, you
> would have been the first to be exterminated. Be glad that Trump won!
>
>

--
You received this message because you are subscribed to the Google Groups
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/everything-list/cc561c2f-598b-4d58-8ab1-6b31e42c27afn%40googlegroups.com.

Re: [Extropolis] AGI has been achieved!

Reply via email to