Stefano Zacchiroli <[email protected]> writes: > FWIW, I looked specifically in the gnubg case a while ago, because it > was an interesting test case for this discussion.
Oh, thank you! I very much appreciate you doing the work to uncover actual facts as opposed to my mostly uninformed speculations. > Here's what I found out: > - The training program (using the language from the GR draft) is > allegedly available and licensed under GPL3. > - The training data is allegedly available as well, but comes without > any declared license. I tend to concur with you, Russ, that it's very > likely non-copyrightable material. But that's only partly reassuring > to me, because I'm not sure how Debian would practically go about > ruling that certain stuff that comes without copyright/license is fine > for main, whereas other stuff in the same situation is not. Yes, this is the tricky part for any sort of general "AI" policy (I agree with Holger that this term is annoying propaganda, but we're probably stuck with it). Right now, people are mostly thinking about LLMs, which are trained on large amounts of writing, which is almost always copyrighted because it's one of the core types of artistic creativity recognized by copyright laws. (Likewise for image generators, which are trained on art.) There are a bunch of other things that fall into the AI bucket, however, and many of them predate the invention of LLMs. Some of them will have similar challenges with training data (translation software is probably also trained on writing, for instance, and voice recognition software is probably trained on voice samples that are often copyrighted). Some of them, however, will be trained on things that are widely recognized to be non-copyrightable facts, such as records of backgammon, chess, or go games. However, even that is tricky, because the *annotations* on chess games can be copyrighted. What is the line beyond which the game annotations are copyrighted material? I personally have no idea; I don't know if tagging moves with !, !!, ?, and ?? but no other commentary would constitute copyrightable material. I also don't know if chess engines use such annotations in their training. The simplest and most ideologically consistent position that we could take, at least from my perspective, would be to decide that any data file in the form of distilled neural network weights or similar encoded training data is the "binary" output of a "compilation" process and the training data is the source code for that binary, which means that under the DFSG the source code not only has to be free software but has to be included in the archive. This is pleasingly ideologically coherent and mostly avoids weird and uncomfortable ethical compromises. However, I'm not sure it's very *practical* unless our position is that we're simply not going to package software that uses machine learning models (a decision that we could certainly make, but which seems a bit contrary to our normal desire to be a universal operating system). Problems just off the top of my head include: 1. This data is often huge and also of very little interest to anyone other than people attempting to confirm the free software status of the resulting model. Unlike the more typical forms of source code, I suspect it's rare to want to tweak the training data to fix some bug or add some feature and then "recompile." I certainly had never considered doing such a thing when maintaining gnubg, but I patched the more conventional source code quite frequently. 2. Using the data to reproduce the model often takes significant amounts of computing resources, quite possibly more than we would like to spend on such a task. But if we don't do that work, we don't really know if we have the real sources. 3. It's quite likely, as I understand it, that the training process is not going to be deterministic, so we may not easily be able to process the training data and get back the original weights. My understanding is that training tends to involve some randomization for technical reasons. Also, even if it's *possible* to design a reproducible training process, I suspect many upstreams will not have bothered. 4. As you discovered, finding the training data, even when upstream has retained it (which I suspect will not always be the case, since I expect in at least some cases upstream would just start over if they wanted to retrain the model and therefore would view at least some of the training data as equivalent to ephemeral object files they would discard), is not going to be easy since almost no one cares. This is of course not a new problem in free software, and we have long experience with telling upstreams that no, we really do care about all of the source code, but it is incrementally more work of a type that most Debian packagers truly dislike doing. I'm a bit worried that people have the specific case of LLMs in mind, which are almost always going to pose copyright problems and derivative work problems. I'm sure I'm not the only one here who is a general LLM skeptic who has been underwhelmed by the quality of the output LLM advocates claim to find useful, and therefore would find it quite easy to say no to LLMs in Debian without feeling like the project was missing anything of significance. But machine learning is a lot older than LLMs and has a lot of useful applications other than mediocre text generation, and training data for at least some of those models doesn't look anything like LLM training data and may have entirely different licensing properties. It feels likely to me that there are some babies in that bathwater. Maybe we've been ethical hypocrites all along about machine learning applications packaged in Debian, and the current LLM craze is a good opportunity to clean house and reaffirm a strict free software policy including training data. I'm rather sympathetic to that argument, frankly, just because the simplicity of the "source code for everything, no exceptions" position is comfortable in my brain. But we should be fairly sure about what we're agreeing to before making that decision. -- Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

