On Sat, 10 May 2025 at 01:17, Thorsten Glaser <[email protected]> wrote: > >I realized that I have one additional generic concern: You claim that > >models are a derivate work of their training input. > > Yes. This is easily shown, for example by looking at how they work, > https://explainextended.com/2023/12/31/happy-new-year-15/ explained > this well, and in papers like “Extracting Training Data from ChatGPT”. > It is a sort of lossy compression that has shown to be sufficiently > un-lossy enough (urgs, forgive my lack of English) that recognisable > “training data” can be recalled, and the operators’ “fix” was to add > filters to the prompts, not to make it impossible, because they cannot.
That is both false and misleading. A compression, even a lossy compression would have a correlation of the content of the output with content of *one* particular input. That is compression. An algorithm that only stores and produces an *average* value across a wide set of inputs can not be any kind of compression. It is data mining. In all provided examples the claim that the model reproduces a particular input is just false. All it does is continue an already started text in a way that is most probable across *all* inputs. That probability only shows similarity to a specific input in two cases - if your input is especially constructed to be absolutely unique and only match the one input document (which means that the copyright violation has already happened in the *question* that you are entering into the LLM) or if the same text or expression is widely spread across many input documents and is in fact a common representation of a fact (like "The capital of Spain is ... Madrid"). The chances of a LLM reproducing one specific input document decreases as you increase the training base and in the end is the same as the chances of a human being accidentally writing or composing something that is actually a copy of some other work (whether they have seen it before or not, whether they remember seeing it or not) - that is also seen as copyright violation for the human. You can see this all the time in the law suites about similarities in music. And in the same way, humans are encouraged not to do this. And the same way there is no guarantee that something written by a human is not an illegal reproduction of a copyrighted material. As soon as you transform an input document into statistical probabilities (which is not a reversible transformation and its output bears no resemblance to the input document), there is no more copyrightable content there and no derived work. As soon as one step produces an non-copyrightable intermediate product, the chain of copyright derivative work is broken and copyright of the training data no longer applies. This is well established in the context of data mining. It is even worse - there is a *lot* of fair use freedom in data mining, see Fox News v. TVEYES, Inc., 43 F. Supp. 3d 379 (S.D. N.Y. 2014) for example. And yes - a simple, one way data transformation by software does destroy copyright. It is trivial to see from simplest examples and then go up. If I run "wc" on a copyrighted work, the number of words in the document is *not* a derived work from the original document. There is simply not enough creative expression for copyright law to even apply to this simple integer. Same if I make a sha265 checksum of the document - the checksum is not a copyright object and is not a derived work. Same happens if I count the number of occurance of individual words (words and their lists are non-copyrightable as well, we have wordlists inside Debian already). Same if I calculate the probabilities of one word following another word. And that is basically what a LLM is - a list of propabilities. If your entire proposal is based on this assumption about how copyright and copyright law works, I would expect something more substantial, like court decisions supporting this radical new interpretation. And overturning things like Article 4 of EU Directive 2019/790 granting near complete copyright exception to text and data mining. This was explicitly referred to in EU AI Directive in context of the use of training data. And overturning of a *ton* of already decided "fair use" cases in USA.

