Re: Accepting AI generated contributions

Patrick McFadin Wed, 23 Jul 2025 12:37:43 -0700

I just did some review on all the case law around copywrite and AI code. So
far, every claim has been dismissed. There are some other cases like
NYTimes which have more merit and are proceeding.


Which leads me to the opinion that this is feeling like a premature
optimization. Somebody creating a PR should not have to also submit a SBOM,
which is essentially what we’re asking. It’s undue burden and friction on
the process when we should be looking for ways to reduce friction.

My proposal is no disclosures required.

On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <[email protected]> wrote:

> According to the thread, the disclosure is for legal purposes. For
> example, the patch is not produced by OpenAI's service. I think having the
> discussion to clarify the AI usage in the projects is meaningful. I guess
> many are hesitating because of the unclarity in the area.
>
> > I don’t believe or agree with us assuming we should do this for every PR
>
> I am with you, David. Updating the mail list for PRs is overwhelming for
> both the author and the community.
>
> I also do not feel co-author is the best place.
>
> - Yifan
>
> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <[email protected]>
> wrote:
>
>> This is starting to get ridiculous. Disclosure statements on exactly how
>> a problem was solved? What’s next? Time cards?
>>
>> It’s time to accept the world as it is. AI is in the coding toolbox now
>> just like IDEs, linters and code formatters. Some may not like using them,
>> some may love using them. What matters is that a problem was solved, the
>> code matches whatever quality standard the project upholds which should be
>> enforced by testing and code reviews.
>>
>> Patrick
>>
>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <[email protected]>
>> wrote:
>>
>>> David is disclosing it in the maillist and the GH page. Should the
>>> disclosure be persisted in the commit?
>>>
>>>
>>> Someone asked me to update the ML, but I don’t believe or agree with us
>>> assuming we should do this for every PR; personally storing this in the PR
>>> description is fine to me as you are telling the reviewers (who you need to
>>> communicate this to).
>>>
>>>
>>> I’d say we can use the co-authored part of our commit messages to
>>> disclose the actual AI that was used?
>>>
>>>
>>> Heh... I kinda feel dirty doing that… No one does that when they take
>>> something from a blog or stack overflow, but when you do that you should
>>> still attribute by linking… which I guess is what Co-Authored does?
>>>
>>> I don’t know… feels dirty...
>>>
>>>
>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella <
>>> [email protected]> wrote:
>>>
>>> That’s a great point. I’d say we can use the co-authored part of our
>>> commit messages to disclose the actual AI that was used?
>>>
>>>
>>>
>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <[email protected]> wrote:
>>>
>>> Curious, what are the good ways to disclose the information?
>>>
>>> > All of which comes back to: if people disclose if they used AI, what
>>> models, and whether they used the code or text the model wrote verbatim or
>>> used it as a scaffolding and then heavily modified everything I think we'll
>>> be in a pretty good spot.
>>>
>>> David is disclosing it in the maillist and the GH page. Should the
>>> disclosure be persisted in the commit?
>>>
>>> - Yifan
>>>
>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <[email protected]>
>>> wrote:
>>>
>>>> Sent out this patch that was written 100% by Claude:
>>>> https://github.com/apache/cassandra/pull/4266
>>>>
>>>> Claudes license doesn’t have issues with the current ASF policy as far
>>>> as I can tell.  If you look at the patch it’s very clear there isn’t any
>>>> copywriter material (its glueing together C* classes).
>>>>
>>>> I could have written this my self but I had to focus on code reviews
>>>> and also needed this patch out, so asked Claude to write it for me so I
>>>> could focus on reviews.  I have reviewed it myself and it’s basically the
>>>> same code I would have written (notice how small and focused the patch is,
>>>> larger stuff doesn’t normally pass my peer review).
>>>>
>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <[email protected]> wrote:
>>>>
>>>> +1 to what Josh said
>>>> Sent from my iPhone
>>>>
>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <[email protected]>
>>>> wrote:
>>>>
>>>> 
>>>> Did some more digging. Apparently the way a lot of headline-grabbers
>>>> have been making models reproduce code verbatim is to prompt them with
>>>> dozens of verbatim tokens of copyrighted code as input where completion is
>>>> then very heavily weighted to regurgitate the initial implementation. Which
>>>> makes sense; if you copy/paste 100 lines of copyrighted code, the
>>>> statistically likely completion for that will be that initial
>>>> implementation.
>>>>
>>>> For local LLM's, the likelihood of verbatim reproduction is
>>>> *differently* but apparently comparably unlikely because they have far
>>>> fewer parameters (32B vs. 671B for Deepseek for instance) of their
>>>> pre-training corpus of trillions (30T in the case of Qwen3-32B for
>>>> instance), so the individual tokens from the copyrighted material are
>>>> highly unlikely to be actually *stored* in the model to be reproduced,
>>>> and certainly not in sequence. They don't have the post-generation checks
>>>> claimed by the SOTA models, but are apparently considered in the "< 1 in
>>>> 10,000 completions will generate copyrighted code" territory.
>>>>
>>>> When asked a human language prompt, or a multi-agent pipelined "still
>>>> human language but from your architect agent" prompt, the likelihood of
>>>> producing a string of copyrighted code in that manner is statistically
>>>> very, very low. I think we're at far more risk of contributors copy/pasting
>>>> stack overflow or code from other projects than we are from modern genAI
>>>> models producing blocks of copyrighted code.
>>>>
>>>> All of which comes back to: if people disclose if they used AI, what
>>>> models, and whether they used the code or text the model wrote verbatim or
>>>> used it as a scaffolding and then heavily modified everything I think we'll
>>>> be in a pretty good spot.
>>>>
>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>>>
>>>>
>>>> 2. Models that do not do output filtering to restrict the reproduction
>>>> of training data unless the tool can ensure the output is license
>>>> compatible?
>>>>
>>>> 2 would basically prohibit locally run models.
>>>>
>>>>
>>>> I am not for this for the reasons listed above. There isn’t a
>>>> difference between this and a contributor copying code and sending our way.
>>>> We still need to validate the code can be accepted .
>>>>
>>>> We also have the issue of having this be a broad stroke. If the user
>>>> asked a model to write a test for the code the human wrote, we reject the
>>>> contribution as they used a local model? This poses very little copywriting
>>>> risk yet our policy would now reject
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <[email protected]> wrote:
>>>>
>>>> 2. Models that do not do output filtering to restrict the reproduction
>>>> of training data unless the tool can ensure the output is license
>>>> compatible?
>>>>
>>>> 2 would basically prohibit locally run models.
>>>>
>>>>
>>>>
>>>>
>>>
>>>

Re: Accepting AI generated contributions

Reply via email to