[Numpy-discussion] Re: Policy on AI-generated code

Ralf Gommers Thu, 04 Jul 2024 07:47:16 -0700

On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett <matthew.br...@gmail.com>
wrote:


> Hi,
>
> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers <ralf.gomm...@gmail.com>
> wrote:
> >
> >
> >
> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett <matthew.br...@gmail.com>
> wrote:
> >>
> >> Sorry - reposting from my subscribed address:
> >>
> >> Hi,
> >>
> >> Sorry to top-post!  But - I wanted to bring the discussion back to
> >> licensing.  I have great sympathy for the ecological and code-quality
> >> concerns, but licensing is a separate question, and, it seems to me,
> >> an urgent question.
> >>
> >> Imagine I asked some AI to give me code to replicate a particular
> algorithm A.
> >>
> >> It is perfectly possible that the AI will largely or completely
> >> reproduce some existing GPL code for A, from its training data.  There
> >> is no way that I could know that the AI has done that without some
> >> substantial research.  Surely, this is a license violation of the GPL
> >> code?   Let's say we accept that code.  Others pick up the code and
> >> modify it for other algorithms.  The code-base gets infected with GPL
> >> code, in a way that will make it very difficult to disentangle.
> >
> >
> > This is a question that's topical for all of open source, and usages of
> CoPilot & co. We're not going to come to any insightful answer here that is
> specific to NumPy. There's a ton of discussion in a lot of places; someone
> needs to research/summarize that to move this forward. Debating it from
> scratch here is unlikely to yield new arguments imho.
>
> Right - I wasn't expecting a detailed discussion on the merits - only
> some thoughts on policy for now.
>
> > I agree with Rohit's: "it is probably hopeless to enforce a ban on AI
> generated content". There are good ways to use AI code assistant tools and
> bad ones; we in general cannot know whether AI tools were used at all by a
> contributor (just like we can't know whether something was copied from
> Stack Overflow), nor whether when it's done the content is derived enough
> to fall under some other license. The best we can do here is add a warning
> to the contributing docs and PR template about this, saying the contributor
> needs to be the author so copied or AI-generated content needs to not
> contain things that are complex enough to be copyrightable (none of the
> linked PRs come close to this threshold).
>
> Yes, these PRs are not the concern - but I believe we do need to plan
> now for the future.
>
> I agree it is hard to enforce, but it seems to me it would be a
> reasonable defensive move to say - for now - that authors will need to
> take full responsibility for copyright, and that, as of now,
> AI-generated code cannot meet that standard, so we require authors to
> turn off AI-generation when writing code for Numpy.
>

I don't think that that is any more reasonable than asking contributors to
not look at Stack Overflow at all, or to not look at any other code base
for any reason. I bet many contributors may not even know whether the
auto-complete functionality in their IDE comes from a regular language
server (see https://langserver.org/) or an AI-enhanced one.

I think the two options are:
(A) do nothing yet, wait until the tools mature to the point where they can
actually do what you're worrying about here (at which point there may be
more insight/experience in the open source community about how to deal with
the problem.
(B) add a note along the lines I suggested as an option above ("... not
contain things that are complex enough to be copyrightable ...")

Cheers,
Ralf

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Policy on AI-generated code

Reply via email to