Hi Aigars, On Sun, May 04, 2025 at 02:27:46PM +0200, Aigars Mahinovs wrote: > On Sun, 4 May 2025 at 13:12, Wouter Verhelst <[1][email protected]> > wrote: > > On Tue, Apr 29, 2025 at 03:17:52PM +0200, Aigars Mahinovs wrote: > > However, here we have a clear and fundamental change happening > in the > > copyright law level - there is a legal break/firewall that is > happening > > during training. The model *is* a derivative work of the source > code of > > the training software, but is *not* a derivative work of the > training > > data. > I would disagree with this statement. How is a model not a > derivative > work of the training data? Wikipedia defines it as > > The simple fact that none of the LLMs have been sued out of > existence by any copyright owner is de facto proof that it does not > work that way in the eyes of the judicial system.
This statement is inaccurate, incorrect, and irrelevant. It is inaccurate, because the legal system does not work that way: the legality of an action is not defined by the presence or absense of a lawsuit pertaining to that action. If it were, then any cold case in the history of mankind must therefore by definition have been legal. More to the point, in this particular case the lack of lawsuits could be explained by a variety of factors, including but not limited to the indifference of the grieved party; the inability to finance a lawsuit against "big tech" companies such as microsoft or facebook; or the believe on the side of the grieved party that they may not have a case in the first place, even when they might have won would they have filed suit. It is incorrect, because the New York Times did in fact file suit against Microsoft, OpenAI, and other parties related to copyright infringement of their large library of news articles in creating ChatGPT[1]. The case is still in court. It is irrelevant, because in a Debian context, the law is relevant only to the point that we must obey it in relevant jurisdictions[2]. It does not have any say over how we define our own rules and ethics. If we decide as Debian that we believe the training data is in fact part of the source of a model, then we can in fact set such a rule. We do not just follow the law in deciding what to distribute and how to do it; if this were the case, then there would never have been any need for a non-US, non-free, or non-free-firmware section of our archive, and the DFSG would have been just this little thing, you know. [1] https://www.courtlistener.com/docket/68117049/the-new-york-times-company-v-microsoft-corporation/ [2] where I define "relevant" as "any jurisdiction where not obeying the law could result in significant problems for Debian", which in practice probably means the US and most of Europe. > Wikipedia definition is a layman's simplification. It may be a simplification, but that in and of itself does not make it incorrect. I do think that a model is in fact a derivative work of the training data, because of the fact that you use the training data to build the model, and that without that training data, the model would be different and it would not act the same. Is that a legal definition? No. Is it a basis on which we could define our own rules and ethics? Sure is. Thanks, -- w@uter.{be,co.za} wouter@{grep.be,fosdem.org,debian.org} I will have a Tin-Actinium-Potassium mixture, thanks.

