Hi,
On 2026-02-21 20:14, Craig Brozefsky wrote:
Ekaitz Zarraga <[email protected]> writes:
The process is different from the perspective of the person who writes
the code: they may not be aware of the copyright violations, the
hallucinations, etc the LLM will produce. From our side it doesn't
really matter much.
I largely agree with your line of argument regarding detection of LLM
usage, quality, and fit.
However, I would suggest that there are at least two categorical legal
differences from the project's perspective, which are independent of
observe-ability:
- LLM generated code is not copyrightable in all jurisdictions (eg. US).
This is a risk to the project's ability to enforce its copyleft
license terms on the aggregate work.
- There is in effect dual authorship. One of the authors is
categorically unaware of its infringement -- and is literally a lossy
recall mechanism trained on copyrighted code. This brings a
considerably higher risk of infringement, which is *independent* of
our ability to detect it. and is the *real* measure of risk of
infringement by the project.
Here I'm not sure how the thing works but I believe when people sends us
code they sign it as theirs, so the ones that are infringing anything is
them.
If the would steal it from the internet (say stack overflow) the issue
is similar. They would probably forget where they took it from and that
piece of code could also have some copyright. The fact that there's an
opaque machine in the middle makes this issue worse, but it's something
we already had but we didn't pay much attention to.
We assumed good faith, maybe we shouldn't have.
A different story is if we actually want as a community to share our
*opinion* about the usage of LLMs, or if we want to ask contributors
to say if their contribution was made using LLMs just for adjusting
the review criteria in those contributions (they could still lie,
though).
I think it's reasonable, and prudent, to ask contributors to correctly
present the authorship of the code, and it's copyright-ability. If
there is a LLM involved, then I think we either may need to ask the
coauthor to perform a review for infringement, or we may reject it
outright. Such a policy could mitigate the increased risk of
infringement, as well as our copyright-ability, which is the foundation
of our ability to assert copyleft license terms. [1]
The authorship and copyrightability we already ask for, don't we?
Isn't that what we, the committers, sign for? (I'm talking about signing
off the commit, which is not required in our own commits)
Maybe I misunderstood my mission! But I'd say we sign to confirm that we
were given the permission to add that piece of code to the project, thus
accepting the terms and implying the one that sent the patch is legally
capable to take that decision over the changes proposed.
Maybe we can just state it more clearly in the docs explicitly
mentioning one can only contribute code they own or they have the rights
over.
This would in practice ban LLM generated code that is just copy-pasted,
but also help in other problematic cases we were overlooking (people
sharing code they wrote in company time, or written by their employees,
or copied from somewhere else...).
Again, I think we are already doing it, so it wouldn't change much. I'm
not sure so, please, correct me if I'm wrong here.
I appreciate the collegiate and measured tone of the discussion on this
topic.
I agree.