Ekaitz Zarraga <[email protected]> writes:

> The process is different from the perspective of the person who writes
> the code: they may not be aware of the copyright violations, the
> hallucinations, etc the LLM will produce. From our side it doesn't
> really matter much.

I largely agree with your line of argument regarding detection of LLM
usage, quality, and fit.

However, I would suggest that there are at least two categorical legal
differences from the project's perspective, which are independent of
observe-ability:

- LLM generated code is not copyrightable in all jurisdictions (eg. US).
  This is a risk to the project's ability to enforce its copyleft
  license terms on the aggregate work.

- There is in effect dual authorship.  One of the authors is
  categorically unaware of its infringement -- and is literally a lossy
  recall mechanism trained on copyrighted code.  This brings a
  considerably higher risk of infringement, which is *independent* of
  our ability to detect it. and is the *real* measure of risk of
  infringement by the project.

> A different story is if we actually want as a community to share our
> *opinion* about the usage of LLMs, or if we want to ask contributors
> to say if their contribution was made using LLMs just for adjusting
> the review criteria in those contributions (they could still lie,
> though).

I think it's reasonable, and prudent, to ask contributors to correctly
present the authorship of the code, and it's copyright-ability.  If
there is a LLM involved, then I think we either may need to ask the
coauthor to perform a review for infringement, or we may reject it
outright.  Such a policy could mitigate the increased risk of
infringement, as well as our copyright-ability, which is the foundation
of our ability to assert copyleft license terms. [1]

I appreciate the collegiate and measured tone of the discussion on this
topic.  In the interest of transparency, my personal opinion at this
time, is leaning towards prohibition of LLM generated code beyond a
specified threshold (auto-complete of a line vs. wholesale generation of
a test of function).  I believe we could find relevant language in the
copyleft community to define that threshold if needed.

[1]: contrast that with say, SBCL, which is BSD and Public Domain mix,
and thus their risk calculus for accepting non-copyrightable code is
different than our GPLv3 license.

Reply via email to