On Mon, 28 Apr 2025 at 18:44, Russ Allbery <[email protected]> wrote:
> Aigars Mahinovs <[email protected]> writes:
>
> > If we take as a given that copyright does *not* survive the learning
> > process of a (sufficiently complex) AI system, then it is *not* necessary
> > that all training *data* for training a DFSG-free AI to also be
> DFSG-free.
> > It is however necessary that:
> > * software needed for inference (usage) of the AI model to be DFSG-free
> > * software needed for the training process of the AI model to be
> DFSG-free
> > * software needed to gather, assemble and process the training data to be
> > DFSG-free or the manual process for it to be documented
>
> Without necessarily disagreeing with this, I want to highlight that
> licensing is only *one* of the considerations behind the DFSG and we
> shouldn't fixate only on it. The other question is whether the training
> data constitutes source code in the sense of DFSG 2. I think there's at
> least a prima facie case that it is: The final training model is quite
> clearly not the preferred form of modification, and anyone who wanted to
> retrain the model would normally prefer to start with the existing
> training data set (and then possibly augment or filter it).
>
>
Yes, that is a very important problem for Debian and, for example, the
Desert Island test would be well applicable here. If re-training the model
would require downloading half the Internet, then it is pretty obvious that
someone on a desert island without a network connection will not be able to
do this. *However*, models again are substantially different from regular
software (that gets modified in source and then compiled to a binary)
because such a model can be *modified* and adapted to your needs directly
from the end state. In fact, for adjusting a LLM for use in a particular
domain or a particular company it actually *is* the "binary" that is the
*preferred* form to be modified - you take a model that "knows" a lot in
general and "knows" how your language works and you train the model further
by doing specialisation training for your, specific data set. And a result
you get from one "generic" binary another - "specialized" binary.
So, very precisely speaking, modification of a LLM does *not* require the
original training data. Recreating a LLM does. Also developing a new LLM
with different training methods or training conditions does need some
training data (ideally the original training data, especially to compare
end performance). But all in all a developer on a Desert Island would be
better off with a "binary" model to be modified than without it.
Say for example that an IDE saves its configuration state not in a common
text file, but as a binary memory dump. Say the maintainer of such a
package would use their experience of the IDE and years of development to
go through the GUI of this software to assemble a great setup configuration
that is great for anyone starting to use the IDE and also has clues left
around it how to tailor it further for your needs. This configuration (as a
binary memory dump of the software state) is then distributed to the users
as the default configuration. What is "the source" of it? Isn't this binary
(that the GUI can both read and write) not the preferred form for
modification? The maintainer can describe how he created the GUI state
(document the training process), but not really include all his relevant
experience (training data) that led him to believe that this state is the
best for the new users. So what is LLama if not a **very** complex nvim
configfile focused on autocomplete? :D Quite a few of those questions also
apply to fonts (IMO).
We (as Debian) do approach DFSG compliance in terms of source code stricter
than many licenses do. We require the source code to be on Debian servers
in Debian-preferred formulation. While GPL, for example, is content with a
promise to send the end user the source code on request.
That *could* be the technical difference in definitions between what is
"DFSG-free AI" and what is "Debian-main-grade-free AI". Especially if
Debian would decide not to want to store literal terrabytes of training
data for every LLM variation. This could be worked around on a more general
level with some kind of data-set preservation and indexing foundation, like
the Internet Archive. In that case the Debian package could reference a
particular assembled data set it used for training (for example in the form
of a magnet link) and delegate storage and re-distribution of that dataset
to external trusted source organisations.
Or Debian could go the MS TTF route - have the software in the archive, but
no models at all. And to get the software to work users would get used to
run a script that would be always pulling a model from huggingface.co
either manually or even during package installation. Possible with a barely
functional placeholder model in the package that 99% of users would replace
in real usage. That would keep the "evil" AI away from the archive, but
will that benefit our users? Will that benefit the development of a freer
and more accessible AI landscape? I would think rather opposite.
--
Best regards,
Aigars Mahinovs