On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett <matthew.br...@gmail.com> wrote:
> Hi, > > On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers <ralf.gomm...@gmail.com> > wrote: > > > > > > > > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett <matthew.br...@gmail.com> > wrote: > >> > >> Sorry - reposting from my subscribed address: > >> > >> Hi, > >> > >> Sorry to top-post! But - I wanted to bring the discussion back to > >> licensing. I have great sympathy for the ecological and code-quality > >> concerns, but licensing is a separate question, and, it seems to me, > >> an urgent question. > >> > >> Imagine I asked some AI to give me code to replicate a particular > algorithm A. > >> > >> It is perfectly possible that the AI will largely or completely > >> reproduce some existing GPL code for A, from its training data. There > >> is no way that I could know that the AI has done that without some > >> substantial research. Surely, this is a license violation of the GPL > >> code? Let's say we accept that code. Others pick up the code and > >> modify it for other algorithms. The code-base gets infected with GPL > >> code, in a way that will make it very difficult to disentangle. > > > > > > This is a question that's topical for all of open source, and usages of > CoPilot & co. We're not going to come to any insightful answer here that is > specific to NumPy. There's a ton of discussion in a lot of places; someone > needs to research/summarize that to move this forward. Debating it from > scratch here is unlikely to yield new arguments imho. > > Right - I wasn't expecting a detailed discussion on the merits - only > some thoughts on policy for now. > > > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI > generated content". There are good ways to use AI code assistant tools and > bad ones; we in general cannot know whether AI tools were used at all by a > contributor (just like we can't know whether something was copied from > Stack Overflow), nor whether when it's done the content is derived enough > to fall under some other license. The best we can do here is add a warning > to the contributing docs and PR template about this, saying the contributor > needs to be the author so copied or AI-generated content needs to not > contain things that are complex enough to be copyrightable (none of the > linked PRs come close to this threshold). > > Yes, these PRs are not the concern - but I believe we do need to plan > now for the future. > > I agree it is hard to enforce, but it seems to me it would be a > reasonable defensive move to say - for now - that authors will need to > take full responsibility for copyright, and that, as of now, > AI-generated code cannot meet that standard, so we require authors to > turn off AI-generation when writing code for Numpy. > I don't think that that is any more reasonable than asking contributors to not look at Stack Overflow at all, or to not look at any other code base for any reason. I bet many contributors may not even know whether the auto-complete functionality in their IDE comes from a regular language server (see https://langserver.org/) or an AI-enhanced one. I think the two options are: (A) do nothing yet, wait until the tools mature to the point where they can actually do what you're worrying about here (at which point there may be more insight/experience in the open source community about how to deal with the problem. (B) add a note along the lines I suggested as an option above ("... not contain things that are complex enough to be copyrightable ...") Cheers, Ralf
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com