[Numpy-discussion] Re: Policy on AI-generated code

Ralf Gommers Thu, 04 Jul 2024 12:42:55 -0700

On Thu, Jul 4, 2024 at 8:42 PM Matthew Brett <matthew.br...@gmail.com>
wrote:


> Hi,
>
> On Thu, Jul 4, 2024 at 6:44 PM Ralf Gommers <ralf.gomm...@gmail.com>
> wrote:
> >
> >
> >
> > On Thu, Jul 4, 2024 at 5:08 PM Matthew Brett <matthew.br...@lis.ac.uk>
> wrote:
> >>
> >> Hi,
> >>
> >> On Thu, Jul 4, 2024 at 3:41 PM Ralf Gommers <ralf.gomm...@gmail.com>
> wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Jul 4, 2024 at 1:34 PM Matthew Brett <matthew.br...@gmail.com>
> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> On Thu, Jul 4, 2024 at 12:20 PM Ralf Gommers <ralf.gomm...@gmail.com>
> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Jul 4, 2024 at 12:55 PM Matthew Brett <
> matthew.br...@gmail.com> wrote:
> >> >> >>
> >> >> >> Sorry - reposting from my subscribed address:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> Sorry to top-post!  But - I wanted to bring the discussion back to
> >> >> >> licensing.  I have great sympathy for the ecological and
> code-quality
> >> >> >> concerns, but licensing is a separate question, and, it seems to
> me,
> >> >> >> an urgent question.
> >> >> >>
> >> >> >> Imagine I asked some AI to give me code to replicate a particular
> algorithm A.
> >> >> >>
> >> >> >> It is perfectly possible that the AI will largely or completely
> >> >> >> reproduce some existing GPL code for A, from its training data.
> There
> >> >> >> is no way that I could know that the AI has done that without some
> >> >> >> substantial research.  Surely, this is a license violation of the
> GPL
> >> >> >> code?   Let's say we accept that code.  Others pick up the code
> and
> >> >> >> modify it for other algorithms.  The code-base gets infected with
> GPL
> >> >> >> code, in a way that will make it very difficult to disentangle.
> >> >> >
> >> >> >
> >> >> > This is a question that's topical for all of open source, and
> usages of CoPilot & co. We're not going to come to any insightful answer
> here that is specific to NumPy. There's a ton of discussion in a lot of
> places; someone needs to research/summarize that to move this forward.
> Debating it from scratch here is unlikely to yield new arguments imho.
> >> >>
> >> >> Right - I wasn't expecting a detailed discussion on the merits - only
> >> >> some thoughts on policy for now.
> >> >>
> >> >> > I agree with Rohit's: "it is probably hopeless to enforce a ban on
> AI generated content". There are good ways to use AI code assistant tools
> and bad ones; we in general cannot know whether AI tools were used at all
> by a contributor (just like we can't know whether something was copied from
> Stack Overflow), nor whether when it's done the content is derived enough
> to fall under some other license. The best we can do here is add a warning
> to the contributing docs and PR template about this, saying the contributor
> needs to be the author so copied or AI-generated content needs to not
> contain things that are complex enough to be copyrightable (none of the
> linked PRs come close to this threshold).
> >> >>
> >> >> Yes, these PRs are not the concern - but I believe we do need to plan
> >> >> now for the future.
> >> >>
> >> >> I agree it is hard to enforce, but it seems to me it would be a
> >> >> reasonable defensive move to say - for now - that authors will need
> to
> >> >> take full responsibility for copyright, and that, as of now,
> >> >> AI-generated code cannot meet that standard, so we require authors to
> >> >> turn off AI-generation when writing code for Numpy.
> >> >
> >> >
> >> > I don't think that that is any more reasonable than asking
> contributors to not look at Stack Overflow at all, or to not look at any
> other code base for any reason. I bet many contributors may not even know
> whether the auto-complete functionality in their IDE comes from a regular
> language server (see https://langserver.org/) or an AI-enhanced one.
> >> >
> >> > I think the two options are:
> >> > (A) do nothing yet, wait until the tools mature to the point where
> they can actually do what you're worrying about here (at which point there
> may be more insight/experience in the open source community about how to
> deal with the problem.
> >>
> >> Have we any reason to think that the tools are not doing this now?
> >
> >
> > Yes, namely that tools aren't capable yet of generating the type of code
> that would land in NumPy. And if it's literal code from some other project
> for the few things that are standard (e.g., C/C++ code for a sorting
> algorithm), we'd anyway judge if it was authored by the PR submitter or not
> (I've caught many issues like that with large PRs from new contributors,
> e.g. translating from Matlab code directly).
> >
> >>
> >>    I ran one of my exercises through AI many months ago, and it found
> and
> >> reproduced the publicly available solution, including the comments,
> >> verbatim.
> >
> >
> > Not close to the same, not really a relevant data point.
>
> The question I was trying to address was - do we have any reason to
> think that current AI will not reproduce publicly-available code
> verbatim.   I don't think we do, and the example was an example of AI
> doing just that.
>

> >> We do agree, enforcement is difficult - but I do not think AI
> >> autogenerated code and looking at StackOverflow are equivalent.  There
> >> is no reasonable mechanism by which looking at StackOverflow could
> >> result in copy-paste of a substantial block of GPL'ed (or other
> >> unsuitably licensed) code.    I do not think we have to be pure here,
> >> just reassert - you have to own the copyright to the code, or point
> >> the license of the place you got it.  You can't do that if you've used
> >> AI.   Don't use AI (to the extent you can prevent it).
> >
> >
> > "don't use AI" is meaninglessly broad. You ignored my example about
> auto-completion.
>
> It doesn't seem that hard to make it more explicit and useful for our
> case - which is avoiding copyright violations.   And we're not writing
> a legal document here, we are asking contributors to help us out by
> avoiding AI that might have picked up copyrighted code.    So, if we
> wanted to unpack that to more detail, we could say something like:
>
> """ Our concern is that the AI will pick up copyrighted code, for
> which we will not be able to give proper attribution, or which we
> should not use according to its license.   In practice, this means we
> ask you not to use substantial AI generated code, by which we mean,
> more than one or two lines deriving from AI.   Thus, editor
> autocompletion of code within a single line is clearly fine, but
> generation of whole functions or classes is not.
> """
>

There are tons of non-AI-powered tools that can generate whole functions,
classes, or even whole projects, both via auto-completion in IDEs and via
custom tools/scripts (e.g., cookiecutter templates, websites you build in
15 minutes with something like Ruby on Rails, etc.). And they've existed
for a very long time.

I'm not a user of the types of tools we're discussing here myself, but if I
was and I needed for example code to create a figure with Matplotlib, with
a 2x3 subplot layout, legend, and triangular data points plus some other
bells and whistles, then I'd be perfectly happy with 20 lines of code from
a tool that does that. Saves a bunch of time, and I'm perfectly capable of
judging whether there's any risk attached to the generated code.

What matters in the end is educating contributors and ensuring they
understand how the copyright of their code works. And that they take
responsibility for it.


>   I'll give you another: just a few days ago I was listening to a
> podcast about a developer who was fluent in Python and less fluent in
> Rust, asking an LLM-powered tool to literally translate code that the
> author themselves had written from Python to Rust. The tool didn't
> generate working code, but it was 90% correct. Which allowed the
> author to finish up the task in half an hour, rather than spending
> half a day or more trying to write the Rust code from scratch.
> >
> > That type of usage is very much valid, and can be a significant
> productivity enhancer. Such tool-assisted developer workflows are only
> going to increase over time. We shouldn't try to forbid that outright, it
> makes little sense to do so nor is enforceable.
>
> I really don't think we need to go into the question of whether AI is
> good or bad, valid or invalid.   It's clearly a risk, in the sense of
> including copyrighted code.   That risk is obvious and current.


So is translating Matlab or R code. Way more common and current today. So
is not having a CLA to some people. Etc.

It may be a substantial benefit in due course, over what a skilled
> programmer can do, for tasks other than generating stuff that is more
> or less boilerplate, or language translation.   But for now, I think
> it's fair to say that there is nothing for which we _need_ AI,


What "we" need is not the point. You are trying to prescribe workflow to
other contributors. Which is a big no-no.


> and therefore, on balance, it seems reasonable to me to have a careful
> policy, at least for now.
>

If you want to write an actual guidance text on how to be careful on
copyright of submitted code, covering AI-generated code as well as other
common problems we've seen (Matlab/R/SO/etc.), we can review it. But "no AI
tools" or "no more than 1-2 lines" definitely will not work.

Cheers,
Ralf

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Policy on AI-generated code

Reply via email to