Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

Aigars Mahinovs Mon, 05 May 2025 14:45:17 -0700

On Sun, 4 May 2025 at 17:30, Wouter Verhelst <[email protected]> wrote:

> It is incorrect, because the New York Times did in fact file suit
> against Microsoft, OpenAI, and other parties related to copyright
> infringement of their large library of news articles in creating
> ChatGPT[1]. The case is still in court.
>
> [1]
> https://www.courtlistener.com/docket/68117049/the-new-york-times-company-v-microsoft-corporation/



Thanks for this link, it has been a very interesting read. I have not read
all documents, but the ones I have read paint this picture:

NYT claims copyright infringement (document 1)
* NYT claims that Microsoft claims that " their conduct is protected as
“fair use” because their unlicensed use of copyrighted content to train
GenAI models serves a new “transformative” purpose.", to which NYT disagrees
* New York Times sues them for copyright infringement on the *outputs* with
the specific note: ". Because the outputs of Defendants’GenAI models
compete with and closely mimic the inputs used to train them, copying Times
works for that purpose is not fair use."
* Also explicitly noted that Microsoft (via Bing) already (legally)
provides users with sniplets of New York Tmes content, but in a smaller
amounts that is possible to get out of theie AI models
* NYT also makes a claim that "These systems were used to create multiple
reproductions of The Times’s intellectual property for the purpose of
creating the GPT models that
exploit and, in many cases, retain large portions of the copyrightable
expression contained in those
works."
* NYT claims that "Unauthorized Reproduction of Times Works During GPT
Model Training" happened
* NYT claims that "Embodiment of Unauthorized Reproductions and Derivatives
of Times Works in
GPT Models" happened and as evidence of that produce outputs from models
that reproduce several paragraphs from NYT articles nearly perfectly.
* NYT claims that "Unauthorized Public Display of Times Works in GPT
Product Outputs" happend as shown in the previous point
* When NYT includes queries with their exibits they are *very* specific -
not asking a generic question, but specifically asking what NYT says about
somethign, what is the content of a specific NYT article or what a specific
NYT author wrote about a particular place
* It is not clear from the text of the claim that the actual article text
is indeed inside the model or if it is being requested and mixed into the
response context based on the very specific query
* NYT claims that " Unauthorized Retrieval and Dissemination of Current
News" happened - similar to same with Bing news, but quoting more of the
content
* and also NYT claims that model hallucinations claim that NYT published
things that NYT did not publish
* notably the provided example does *not* include NYT in the text of the
query so this would not trigger retrieval of specific articles for reference
* NYT claims "In the alternative, to the extent an end-user may be liable
as a direct infringer based
on output of the GPT-based products, Defendants materially contributed to
and directly assisted
with the direct infringement perpetrated by end-users of the GPT-based
products" - specifically they claim that (in case the court decides that
actual infringers are the end users) Microsoft is still liable for allowing
such requests and responses

OpenAI respons with (document 52)
* Claiming that their models will not produce verbatim copies of the NYT
articles in their normal use and allege manipulation of the interface,
including the query including the text of the articles in the context of
the question (either directly or via upload or via a gatherable link)
* OpenAI claims that it is "fair use under copyright law to use publicly
accessible content to train generative AI models to
learn about language, grammar, and syntax, and to understand the facts that
constitute humans’
collective knowledge" as nether facts nor rules of language are
copyrightable
* “The general rule of law
is, that the noblest of human productions—knowledge, truths ascertained,
conceptions, and
ideas—become, after voluntary communication to others, free as the air to
common use.” is qouted as foundational part of copyright law
* OpenAI claims that some actions happened more than three years ago (like
gathering the data sets) and thus can not be sued anymore
* OpenAI claims that contributing to copyright infringement by end users
requires actual knowledge of specific infringement - a generic possibility
is not sufficient
* During explainiung the LLM process OpenAI also describes how data sets
like "WebText", WebText2 and Common Crawl were used for training - such
data sets (held and distributed by third parties and not Debian) could be
used for reproduction of (otherwise) free models
* OpenAI specifically calls out that their early models were surprisingly
able to translate from French to English, despite being specifically
cleaned from non-English data sources
* OpenAI claims that "Indeed, it has long been clear that the
non-consumptive use of copyrighted material (like
large language model training) is protected by fair use" .. " Since
Congress codified that doctrine in 1976 (courts should “adapt” defense to
“rapid technological
change”), courts have used it to protect useful innovations like home video
recording, internet search, book search tools, reuse of software APIs, and
many others. "
* " These precedents reflect the foundational principle that copyright law
exists to control the
dissemination of works in the marketplace—not to grant authors “absolute
control” over all uses
of their works."
* " Copyright is not a veto right over transformative
technologies that leverage existing works internally—i.e., without
disseminating them—to new
and useful ends, thereby furthering copyright’s basic purpose without
undercutting authors’ ability
to sell their works in the marketplace "
* OpenAI claims that model regurgitation and hallucination are uncommon and
undesirable properties of the models.
* The regurgitation can often happen if some text appears many times in the
training data in the same for because it has already been copied to many
diverse sources.
* OpenAI explains that hallucinations show the actual, statistical basis
for the responses
* OpenAI claims that NYT claims are misleading as even when asked for
specific qoutes from specific articles, the model would actually output
random parts of those articles which NYT complain cut out to make an
impression of precise recall
* OpenAI notes that NYT chose to only query articles that are between 2.5
and 20 years old (and that have, presumably, been qouted around the web)
Microsoft responds with document 65
* also detailing that NYT promts that caused qouting of the NYT articles
involved *very* specific and unrealistic queries that often included whole
paragraphs of the specific articles that the promt was fishing for

The documents after that are mostly fighting about rules of discovery and
trying to dismiss some charges in advance of actual arguments.

Comments from the judge seem to indicate the focus on two questions:
* whether use of copyrighted material in the training process is fair use
on the basis of the use being sufficiently transformative
* in what specific conditions it is or not possible to get the model to
reproduce parts of copyrighted materials

My opinion:

The question of the models themselves being derrived work of the trainign
data comes up only in the context of verbatim copies of inputs appearing in
outputs. The provided examples of such cases look extensively doctored to
me (and to OpenAI experts) to the point of the user providing half of a
well qouted article and then seeing the model statistically continuing the
article in some semi-random parts. At that point it is not storage, but
just abuse of statistics. And modern models are protected against such
problems. The size of the models is literally too small to contain all the
articles of the training data, even with best compression. I believe that
OpenAI and Microsoft will be able to show that queries that NYT provided as
examples are themselves introducing the copyrighted material.

As for the fair use in the training process, there is quite a lot of
precedent for this, there are a lot of different examples in
https://fairuse.stanford.edu/overview/fair-use/cases/

I, particularly, see as most relevant the thumbnail case and the case
of Richard Prince collage and Google Books case.

The thumbnail case shows that ever technically trivial and deterministic
pure-software transformation of an image to a smaller image can be fair use
in the right context.
And the second case shows that even fully reporoducing a full copy of
multiple copyrightable works in a new work and distributing that can be a
transformative fair use case.
Google Books indexed millions of books submitted by libraries for full text
search across the books (a database of the actual texts of the books) and
would provide the users with excepts from the copyrighted books as part of
responses to the queries.

All those cases were seen as fair use and thus not infringing on copyright
of the original works.

-- 
Best regards,
    Aigars Mahinovs

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

Reply via email to