(While I find the tone of the email a bit exasperated, I will try to reply factually and I hope it will be received as such.)
On Wed, 7 May 2025 at 11:34, Simon Josefsson <[email protected]> wrote: > > Aigars Mahinovs <[email protected]> writes: > > > On Wed, 7 May 2025 at 02:56, Russ Allbery <[email protected]> wrote: > > > >> > >> I think if any of the options in the current GR except Aigars's (and maybe > >> Sam's?) passes, that would effectively be a change in our current policy, > >> even if the current policy is not precisely intentional. > > > > > > IMHO my option will also be a change in our current policy, but, instead of > > requiring the training data itself, my option would just require adding a > > documentation section describing how to create/gather and process data > > required to train such models *if* someone would want to reproduce them. > > Would failure for anyone else to be able to reproduce them be a RC bug? Depends on the clarity and explicticity of the instruction. OSI uses the criteria that a skilled person should be able to build a substantially equivalent system with the given instructions. So it would be technically sufficient if *someone* is able to reproduce sufficiently similar results. Others may be unable to due to committing some mistake in the process, which often includes a value judgement (like recognising when a model is overfitted). > Do the tools required for reproducing the model have to be in Debian > main, or are non-free or external proprietary tools okay? Yes, all software required for creating the training data set, transforming the training data set, training the model and using the model has to be DFSG-free software in Debian main. That part was never in question in any definition being discussed AFAIK. > Do the toolchain for LLM models support bit-by-bit reproducible outputs? AFAIK - no. Bit-by-bit reproducibility is also not a DFSG criteria. > Is a Build-Depends on such a LLM-model acceptable? Then we could > eventually replace the source code for `sudo` in Debian with a LLM > prompt like "write me a secure replacement for sudo and output a > executable ELF binary for my host architecture". In fact, with a bit of > more irony, we could replace a lot of insecure source code this way. That is a fun question, but you would get the same exact answer regardless of what training data was used to train such LLM. Even if a LLM were to be created that was *only* trained on contents of Debian main. Replacing source code of a package with a call to a generator would be silly in many different ways. (And it would not really generate a binary, that's not how LLMs work - they still output words.) However, there is nothing problematic about a developer using an LLM to generate source code, that *after developers* review becomes part of a wider code base implementing useful functionality. This could also be very productively used to generate drafts of API documentation and unit tests. It is no different from templating and scaffolding. The developer executing those requests and reviewing the code owns the copyright of the generated material. > I'm not convinced this approach leads to something desirable. I fear it > means people will have yet another way to add proprietary content into > Debian, and that Debian give up further on caring about user freedom. Being able to reproduce Debian binaries bit-by-bit from developer-readable source code is a good feature, but it does not really appear in the user freedoms defined in the Debian Social Contract. Even the ability to fully automatically rebuild a particular Debian package can disappear over time as external dependencies and environments change and these changes also require adaptations to be made by a skilled person to be able to again build a substantially equivalent system. Pushing towards models that provide a way to modify the model behaviour *after* base training *adds* to the user freedoms for modification and derived works compared to fixed software binaries we have now. Pushing towards descriptions of recreation processes for model weights and of training data *adds* to the freedoms for access to the real source code to the users compared to distributing only the already distilled expert knowledge like we do right now. -- Best regards, Aigars Mahinovs

