"*indicating fundamental differences with human intelligence.*" Dude! Human intelligence is the property of consciousness of being able to bring new ideas into existence out of nothing. You cannot simulate such a thing. Omg... so many children! When will you people ever grow up ?
On Saturday, 21 December 2024 at 12:14:05 UTC+2 PGC wrote: > Not enough detail for any conclusions. According to the blog post on > arcprize.org (see here: https://arcprize.org/blog/oai-o3-pub-breakthrough > ), > > *“OpenAI’s new o3 system—**trained on the ARC-AGI-1 Public Training set**—has > scored a breakthrough 75.7% on the Semi-Private Evaluation set at our > stated public leaderboard $10k compute limit. A high-compute (172x) o3 > configuration scored 87.5%.” * > > This excerpt is central to the discussion. The blog is announcing what it > calls a “breakthrough” result, attributing the model’s performance on an > evaluation set to the new “o3 system.” The mention of the “$10k compute > limit” probably refers to a constraint or budget allocated for training > and/or inference on the public leaderboard. Additionally, there is a > statement that when the system is scaled up dramatically (172 times more > compute resources), it manages to score 87.5%. The difference between the > 75.7% result and 87.5% result is thus explained by a large disparity in the > computational budget used for training or inference. > > More significantly, as I’ve highlighted in bold, the model was explicitly > trained on the very same data (or a substantial subset of it) against which > it was later tested. The text itself says: “trained on the ARC-AGI-1 Public > Training set” and then, in the next phrase, reports scores on “the > Semi-Private Evaluation set,” which is presumably meant to be the test > portion of that same overall dataset (or at least closely related). While > it is possible in machine learning to maintain a strictly separate portion > of data for testing, *the mention of “Semi-Private” invites questions > about how distinct or withheld that portion really is.* If the > “Semi-Private Evaluation set” is derived from the same overall data > distribution used in training, or if it contains overlapping examples, then > the resulting 75.7% or 87.5% scores might reflect overfitting/memorization > more than genuine progress toward robust, generalizable intelligence. > > The separation of training data from test or evaluation data is critical > to ensure that performance metrics capture generalization, rather than the > model having “seen” or memorized the answers in training. When a blog post > highlights a “breakthrough” but simultaneously acknowledges that the data > used to measure said breakthrough was closely related to the training set, > skeptics like yours truly naturally question whether this milestone is more > about tuning to a known distribution instead of a leap in fundamental > capabilities. Memorization is not reasoning, as I’ve stated many times > before. But this falls on deaf ears here all the time. > > Beyond the bare mention of “trained on the ARC-AGI-1 Public Training set,” > there is an implied process of repeated tuning or hyperparameter searches. > If the developers iterated many times over that dataset—adjusting > parameters, architecture decisions, or training strategies to maximize > performance on that very same distribution—then the reported results are > likely inflated. In other words, repeated attempts to boost the benchmark > score can lead to an overly optimistic portrayal of performance. Everybody > knows that this becomes “benchmark gaming,” or in milder terms, an > accidental overfitting to the validation/test data that was supposed to be > isolated. > > The blog post mentions two different performance figures: one achieved > under the publicly stated $10k compute limit, and another, much higher > score (87.5%) when the system was scaled up 172 times in terms of compute > expenditure. Consider the costliness of such experiments for so small of a > performance bump! That’s not a positive sign for optimists regarding this > question. > > AI models can indeed be improved—sometimes dramatically—by throwing more > compute at them. However, from the perspective of practicality or genuine > progress, it may be less impressive if the improvement depends purely on > scaling up hardware resources by a large factor, rather than demonstrating > a new or more efficient approach to learning and reasoning. Is this > warranted if the tasks the model is solving do not necessarily require > advanced reasoning skills that a much smaller system (or a human child) can > handle in a simpler way? > > If a large-scale AI system with a massive compute budget is merely > matching or modestly exceeding the performance that a human child can > achieve, it undercuts the notion of a major “breakthrough.” Additionally, > children’s ability to adapt to novel tasks and generalize without being > artificially “trained” on the same data is a key part of the skepticism: > the kind of intelligence the AI system is demonstrating might be narrower > or more brittle compared to natural, human-like intelligence. > > All of this underscores how these “breakthrough” claims can be misleading > if not accompanied by rigorous methodology (e.g., truly held-out data, > minimal overlap, reproducible results under consistent compute budgets). > While the raw numbers of 75.7% and 87.5% might look impressive at face > value, the context provided—including the fact that the ARC-AGI-1 dataset > was also used for training—casts doubt on the significance of those scores > as an indicator of robust progress in AI or alignment research. > > I leave you with a quote from the blog: > > *Passing ARC-AGI does not equate to achieving AGI, and, as a matter of > fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, > indicating fundamental differences with human intelligence.* > > *Furthermore, early data points suggest that the upcoming ARC-AGI-2 > benchmark will still pose a significant challenge to o3, potentially > reducing its score to under 30% even at high compute (while a smart human > would still be able to score over 95% with no training). This demonstrates > the continued possibility of creating challenging, unsaturated benchmarks > without having to rely on expert domain knowledge. You'll know AGI is here > when the exercise of creating tasks that are easy for regular humans but > hard for AI becomes simply impossible.* > > And even with that, the memory vs. reasoning problem doesn’t vanish. You > believe that your interlocutor is generally intelligent or not. But I’ve > repeated this so many times, it’s getting too time consuming to keep > responding in detail. Thanks to John anyway for posting past all the > narcissism needing therapy spam here. It’s getting tedious, I have to agree > with Quentin. Less and less posts with big picture + good level of nuances. > > On Saturday, December 21, 2024 at 5:17:52 PM UTC+8 Cosmin Visan wrote: > >> @Brent. Shut up you woke communist! In case you don't know, you are a >> straight white male. It that woke feminazi would have won the election, you >> would have been the first to be exterminated. Be glad that Trump won! >> >> -- You received this message because you are subscribed to the Google Groups "Everything List" group. To unsubscribe from this group and stop receiving emails from it, send an email to everything-list+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/everything-list/f758d4c3-6232-4fa1-a6da-b0f6cb482713n%40googlegroups.com.