I don't understand the need for multi pass training in LLMs. The human
brain is single pass, and so are all of the neural network compressors I
have written since 1999. If you want to predict the next bit in some
context, then you just count the zeros and ones already seen. My early PAQ
versions did this explicitly. Later versions did this implicitly by
decreasing the learning rate over time, by mixing fast and slow learning
models, and by using indirect context models that map a bit history to a
prediction. If LLMs aren't doing something like this, they are wasting a
lot of GPU cycles.

As for "toxicity" (racism, being rude to customers, whatever), I don't see
how multi pass training makes a difference. LLMs are racist because most
people are racist, but you can still use RL to train it not to make racist
statements without changing its world model just like you can with people.

-- Matt Mahoney, [email protected]

On Thu, May 15, 2025, 2:25 PM James Bowery <[email protected]> wrote:

>
>
> On Thu, May 15, 2025 at 12:39 PM Matt Mahoney <[email protected]>
> wrote:
>
>>
>> In a text compressor, the model is updated after each prediction.
>>
>
> Single pass text compressors do that.  Existing "large models" are
> multi-pass.  As long as existing scaling laws of machine learning remain
> un-disturbed by fundamental advances (as might be discovered at relatively
> low risk to payoff ratio by, for example, a $1e9 Hutter Prize purse) there
> is good reason to believe that all down-stream stages from the "foundation
> model" aka "pre-training model" aka (?) not only benefit from better
> approximation of the algorithmic information of the corpus, but that it may
> be futile to patch their inadequacies by any amount of downstream
> modification.
>
> For example, all attempts at suppressing "toxicity", such as "cleaning"
> the data and then using RL turn out to have been, in hindsight, obviously
> wrong.  All you need to do is let the multi-pass compression of the
> unedited corpus  during "pre-training" of the "foundation model" do the
> forensic epistemology.  This doesn't tell you what is "toxic" vs what is
> not "toxic" but it _does_ enable the RL to figure out what you mean by
> "toxic" more efficiently when you whack it upside the head for being
> "toxic".  This is because forensic epistemology -- ie:  ruthless truth
> discovery -- provides a predictive ontology.
>
> You don't need ongoing updates from "news" to get such essential modeling
> work done and there is reason to believe it may be a wasted effort to add
> new data to the mix unless you are willing to unlearn a great deal that
> existing scaling laws will force you to relearn at exponential costs.  You
> will want to do that when there is sufficient backlog of "news" to justify
> the cost.
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> +
> delivery options <https://agi.topicbox.com/groups/agi/subscription>
> Permalink
> <https://agi.topicbox.com/groups/agi/Tdc5c19d0f38aacd6-M21cd4aa52a24c88e49cfb192>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc5c19d0f38aacd6-M134537b36974fb95b6d8522e
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to