Aigars Mahinovs <[email protected]> writes: > *However*, models again are substantially different from regular > software (that gets modified in source and then compiled to a binary) > because such a model can be *modified* and adapted to your needs > directly from the end state. In fact, for adjusting a LLM for use in a > particular domain or a particular company it actually *is* the "binary" > that is the *preferred* form to be modified - you take a model that > "knows" a lot in general and "knows" how your language works and you > train the model further by doing specialisation training for your, > specific data set. And a result you get from one "generic" binary > another - "specialized" binary.
I have to say that I'm not convinced by this argument that models are any different than other types of software. To me, this type of "modification" is akin to using code as a library without modifying it. Yes, that is a thing that people often want to do. It is by far the most common way to use a library, because this is the whole point of a library. But we still hold libraries to the DFSG. It's very, very rare for me to want to modify libc, or for that to be a good idea, but we wouldn't ship libc without source code, because sometimes we really do want to modify the library itself. One of the reasons why I'm so leery of a theoretical argument that tries to say that a machine learning model isn't really software in the sense that we think of it is that this conclusion is appealingly convenient. It's very impractical and difficult to treat the training data as source code, so I have a subconscious temptation to find some reason to justify why it's not, which can lead to magnifying differences to a point that I worry isn't justified. I think I'd personally be more comfortable with tackling the real problem head on: We're probably not capable, in general, of treating the training data like source code, so now what? But I am one of those people who prefers a system of broad and conflicting rights that require thoughtful balancing, rather than a system of narrow and absolute rights. > So, very precisely speaking, modification of a LLM does *not* require > the original training data. Recreating a LLM does. Also developing a new > LLM with different training methods or training conditions does need > some training data (ideally the original training data, especially to > compare end performance). But all in all a developer on a Desert Island > would be better off with a "binary" model to be modified than without > it. This last argument is true of all proprietary software, though. One is always better off, at least in some immediate practical sense, having something with severe usage restrictions than not having anything at all. This isn't the test we use for the DFSG, though. Debian's position is that if we can't offer you all of the DFSG freedoms, we don't put the software in main, even if it would still be very useful within those restrictions. > Say for example that an IDE saves its configuration state not in a > common text file, but as a binary memory dump. Say the maintainer of > such a package would use their experience of the IDE and years of > development to go through the GUI of this software to assemble a great > setup configuration that is great for anyone starting to use the IDE and > also has clues left around it how to tailor it further for your needs. > This configuration (as a binary memory dump of the software state) is > then distributed to the users as the default configuration. What is "the > source" of it? I agree that in this case there is no separate source code and this binary data structure is the preferred form of modification. But that's because this data structure was created by a human directly, not by an automated process. It is a configuration file that a user wrote via an editor (the IDE). > Isn't this binary (that the GUI can both read and write) not the > preferred form for modification? The maintainer can describe how he > created the GUI state (document the training process), but not really > include all his relevant experience (training data) that led him to > believe that this state is the best for the new users. I guess all I can say is that I disagree with this way of analyzing the situation on a whole lot of levels, philosophical, practical, and legal. To me, this is making the unwarranted leap to assuming that machine learning models are like Commander Data from Star Trek: independent life forms that are morally equivalent to a human being and therefore should receive the same special treatment in free software ethics as human beings. To me, this is just obviously not the case, and I have absolutely no qualms about treating human activity as fundamentally and completely different than computer activity in our ethics and in our free software guidelines. > Or Debian could go the MS TTF route - have the software in the archive, > but no models at all. And to get the software to work users would get > used to run a script that would be always pulling a model from > huggingface.co either manually or even during package installation. > Possible with a barely functional placeholder model in the package that > 99% of users would replace in real usage. That would keep the "evil" AI > away from the archive, but will that benefit our users? I would echo the pleas elsewhere to avoid loaded terms like "evil" or "toxic" or whatever, because we don't have to agree on a morality in order to agree on an ethical structure for deciding what is and isn't free software. I personally do not believe proprietary software is evil in some greater moral sense. I know there are people in the free software community who believe this, but I do not, and I am not required to believe this to participate in Debian. All that I'm required to do is to agree that Debian is for a specific type of software that meets a set of ethical requirements, and that software that does not meet those ethical requirements, whether good or bad, useful or not useful, should not be part of Debian. If I want to work on such software, I am free to do that, just not here. Debian provides a general-purpose computing platform that I can (and do) use to do all sorts of things that fall outside the scope of the Debian Project. We don't need to, and should not, decide that everything that falls outside of Debian's DFSG is evil. That's not the purpose of our guidelines. The purpose is to set the boundaries of what the project is for. Different people in the project will agree to those boundaries for different reasons and with entirely different personal perspectives on the morality of them. We don't have, or need, conformity here. My goal in this discussion is to advocate for clearly defining the boundaries of Debian so that people can rely on those definitions when deciding whether to do their work inside Debian or elsewhere. It's perfectly fine for us to ask people to do some kinds of work elsewhere. Debian is quite far from the only worthwhile software organization in the world. It's fine for us to limit our scope for many different reasons, including to avoid disruptive internal conflict, and that does not carry any project-wide judgment on the things we have decided to not actively support. > Will that benefit the development of a freer and more accessible AI > landscape? This is not a goal of the Debian Project at present. It of course could be if we decided to adopt it, but it's not at all clear to me that we would choose to do so. (It may, of course, be a goal of some individuals within the Debian Project, and that's fine, but that doesn't carry as much weight in our project-wide decision-making process.) -- Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

