On Fri, 7 Feb 2025 at 16:04, Stefano Zacchiroli <[email protected]> wrote: > I don't think we should focus our conversation on LLMs much, if at all.
While I agree LLMs tend to be the tail wagging the dog in AI/ML discussion, the thread focuses on LLMs and the resulting policy will apply to them. > The reason is that, even if a completely free-as-in-freedom (including > in its training dataset), high quality LLM were to materialize in the > future, its preferred form of modification (which includes the dataset) > will be practically impossible to distribute by Debian due to its size. There are several candidates already, including Ai2's OLMO 2[1] and Pleias[2]: "They Said It Couldn’t Be Done[3] Training large language models required copyrighted data until it did not. [...] These represent the first ever models trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license. These are the first fully EU AI Act compliant models. In fact, Pleias sets a new standard for safety and openness." Given these provide a foundation on which future developers can build, as well as an example others can follow, there will be many more. Conversely, if we propagate the myth that these are too big/hard/costly to create with today's tools, let alone tomorrow's, then we run the risk people believe us. Not long ago even obtaining a computer that could download and compile software was out of the reach of most! On the "preferred form" (wording from the OSD rather than the DFSG), this is subjective and will be different for one than for another. While Sam may possess the tools and techniques to assess and address bias to some extent with weights only, if I as a security researcher or data protection officer need to detect and entirely eliminate problematic content (e.g., backdoors or "right to be forgotten" requests) then the *only* form I can accept is the training data, thus making it my "preferred form". I can't just say to a privacy commissioner or judge "there was only a 0.7% chance patients' medical records would be revealed, your honour". While Sam's tools are improving, so are tools that can reverse the training process (e.g., DLG/iDLG for model inversion which "stands out due to its ability to extract sensitive information from the training dataset and compromise user privacy"[4]). Just as the software vendor doesn't get to tell users what constitutes an improvement for the purposes of the free software definition, we don't get to tell practitioners what the subjective "preferred form" means. That's why I prefer the objective "actual form" Sam referred to in suggesting "We look at what the software authors *actually do* to modify models they incorporate to determine the preferred form of modification". I guarantee some will reach for the data, so it must be included for that freedom to be fully protected. > So when we think of concrete examples, let's focus on what could be > reasonably distributed by Debian. This includes small(er) generative AI > language models, but also all sorts of *non-generative* AI models, e.g., > classification models. The latter do not generate copyrightable content, > so most of the issues you pointed out do not apply to them. We can't make a valid decision or draft a policy focusing on models which have no ability to create output that violates copyrights, only to then put the project, its derivitatives, and users in legal hot water with others that do. You do raise a good point about what we can reasonably distribute with Debian, and many models would already exceed our current capacity (even without the dependencies required for reproducibility). This is a solvable problem though, and it's better to deliver utility to our users by solving it than compromise on our principles or give up altogether. Common Crawl don't host their own dumps, for example. > Other issues > still apply to them, including biases analyses (at a scale which *is* > manageable, addressing some of the issues pointed out by hartmans), and > ethical data sourcing. I'm not sure I accept that relying on fair use for training only to then incite direct infringement by users through deliberate or inadvertent reproduction per proposed policies can be considered "ethical data sourcing". Even if fair use did extend to cover infringing model outputs, it would clearly be against the wishes of the authors. This much is clear from the various generative AI lawsuits already underway[5], including a class action against Bloomberg[6], who joins Software Heritage in the small and shrinking group of OSAID endorsers[7]. - samj 1. https://allenai.org/blog/olmo2 2. https://simonwillison.net/2024/Dec/5/pleias-llms/ 3. https://huggingface.co/blog/Pclanglais/common-models 4. https://arxiv.org/abs/2501.18934v1 5. https://generative-ai-newsroom.com/the-current-state-of-genai-copyright-lawsuits-203a1bd0f616 6. https://admin.bakerlaw.com/wp-content/uploads/2024/01/ECF-74-Amended-Complaint.pdf 7. https://opensource.org/ai/endorsements

