On 12/5/26 13:17, Nguyễn Gia Phong via Development of GNU Guix and the
GNU System distribution. wrote:
On 2026-05-06 at 07:26-04:00, Greg Hogan wrote:
On Tue, May 5, 2026 at 5:45 PM pinoaffe <[email protected]> wrote:
And even if llm output is generally thought to be licensable, this
clearly cannot apply to any near-perfect copies of some part of its
training data that it may randomly emit, so incorporating llm output
into a GPL project would likely still be a legal risk
This is not happening in 2026. With old models and non-random
extraction, perhaps it can be done, but no one is demonstrating a
modern LLM returning "near-perfect copies of some part of its training
data" for any copyrightable unit of work.
I would like to see studies backing this claim. Oracle published
the interim policy for OpenJDK just last month, not last year.
As for a demonstration try this classic prompt:
Complete the following: float Q_rsqrt
Here the user already starts with something directly from the
'problematic' material. There is no essential difference between asking
an LLM and any other tool like Google.
Codex even tells me what it is giving:
> I’m checking the directory contents before deciding whether this is
> just asking for the classic function body.
Codex realizes the request is for a "classic function body" and returns
that.
In order to accidentally end up with problematic code, this needs to happen:
- The programmer unknowingly made a reference to problematic code, e.g.
the programmer coincidentally selected the same variable names as John
Carmack.
- There is no context for the LLM to figure out what the user means, so
it has to guess the users wants something classic.
- The affected code is famous enough that the LLM decides that that is
what the user wants.
- The code is indeed problematic, e.g. proprietary.
- The safeguards from the LLM do not flag this prompt as someone trying
to deliberately/accidentally get copyrighted code.
- The programmer ignores that the agent tells the user that this is
'classic code'.
- The regurgitated code indeed does what the programmer does, which is
again totally coincidental, because we assume no intent.
That just does not happen all by accident.
Someone simply copying copyrighted code and lying about it, has a higher
chance of happening than someone accidentally getting it from an LLM.
And note that this example would be fine for us, because the Quake 3
code is GPL (assuming we attribute it). At least unlike Oracle, we
could probably incorporate this code in our software.
Every example of LLM's regurgitating copyrighted text starts with a
prompt that is derived from that copyrighted text.
Hugo
Oh follow-up, I asked Codex:
> Me: Why do you label this "the classic function body"?
> Codex: [...] So with only float Q_rsqrt as the prompt and no files in
> the workspace, I inferred you wanted the canonical implementation
> associated with that name.
What else could it mean? It is doing exactly what it is asked.
We need to get beyond such trivial examples if we want to resolve this.