Ekaitz Zarraga <[email protected]> writes:
> The process is different from the perspective of the person who writes > the code: they may not be aware of the copyright violations, the > hallucinations, etc the LLM will produce. From our side it doesn't > really matter much. I largely agree with your line of argument regarding detection of LLM usage, quality, and fit. However, I would suggest that there are at least two categorical legal differences from the project's perspective, which are independent of observe-ability: - LLM generated code is not copyrightable in all jurisdictions (eg. US). This is a risk to the project's ability to enforce its copyleft license terms on the aggregate work. - There is in effect dual authorship. One of the authors is categorically unaware of its infringement -- and is literally a lossy recall mechanism trained on copyrighted code. This brings a considerably higher risk of infringement, which is *independent* of our ability to detect it. and is the *real* measure of risk of infringement by the project. > A different story is if we actually want as a community to share our > *opinion* about the usage of LLMs, or if we want to ask contributors > to say if their contribution was made using LLMs just for adjusting > the review criteria in those contributions (they could still lie, > though). I think it's reasonable, and prudent, to ask contributors to correctly present the authorship of the code, and it's copyright-ability. If there is a LLM involved, then I think we either may need to ask the coauthor to perform a review for infringement, or we may reject it outright. Such a policy could mitigate the increased risk of infringement, as well as our copyright-ability, which is the foundation of our ability to assert copyleft license terms. [1] I appreciate the collegiate and measured tone of the discussion on this topic. In the interest of transparency, my personal opinion at this time, is leaning towards prohibition of LLM generated code beyond a specified threshold (auto-complete of a line vs. wholesale generation of a test of function). I believe we could find relevant language in the copyleft community to define that threshold if needed. [1]: contrast that with say, SBCL, which is BSD and Public Domain mix, and thus their risk calculus for accepting non-copyrightable code is different than our GPLv3 license.
