Re: Accepting AI generated contributions

David Capwell Wed, 23 Jul 2025 11:32:09 -0700

> David is disclosing it in the maillist and the GH page. Should the disclosure 
> be persisted in the commit?


Someone asked me to update the ML, but I don’t believe or agree with us 
assuming we should do this for every PR; personally storing this in the PR 
description is fine to me as you are telling the reviewers (who you need to 
communicate this to).


> I’d say we can use the co-authored part of our commit messages to disclose 
> the actual AI that was used? 

Heh... I kinda feel dirty doing that… No one does that when they take something 
from a blog or stack overflow, but when you do that you should still attribute 
by linking… which I guess is what Co-Authored does?

I don’t know… feels dirty...


> On Jul 23, 2025, at 11:19 AM, Bernardo Botella <[email protected]> 
> wrote:
> 
> That’s a great point. I’d say we can use the co-authored part of our commit 
> messages to disclose the actual AI that was used? 
> 
> 
> 
>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <[email protected]> wrote:
>> 
>> Curious, what are the good ways to disclose the information? 
>> 
>> > All of which comes back to: if people disclose if they used AI, what 
>> > models, and whether they used the code or text the model wrote verbatim or 
>> > used it as a scaffolding and then heavily modified everything I think 
>> > we'll be in a pretty good spot.
>> 
>> David is disclosing it in the maillist and the GH page. Should the 
>> disclosure be persisted in the commit? 
>> 
>> - Yifan
>> 
>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> Sent out this patch that was written 100% by Claude: 
>>> https://github.com/apache/cassandra/pull/4266
>>> 
>>> Claudes license doesn’t have issues with the current ASF policy as far as I 
>>> can tell.  If you look at the patch it’s very clear there isn’t any 
>>> copywriter material (its glueing together C* classes).
>>> 
>>> I could have written this my self but I had to focus on code reviews and 
>>> also needed this patch out, so asked Claude to write it for me so I could 
>>> focus on reviews.  I have reviewed it myself and it’s basically the same 
>>> code I would have written (notice how small and focused the patch is, 
>>> larger stuff doesn’t normally pass my peer review).
>>> 
>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> +1 to what Josh said
>>>> Sent from my iPhone
>>>> 
>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> 
>>>>> Did some more digging. Apparently the way a lot of headline-grabbers have 
>>>>> been making models reproduce code verbatim is to prompt them with dozens 
>>>>> of verbatim tokens of copyrighted code as input where completion is then 
>>>>> very heavily weighted to regurgitate the initial implementation. Which 
>>>>> makes sense; if you copy/paste 100 lines of copyrighted code, the 
>>>>> statistically likely completion for that will be that initial 
>>>>> implementation.
>>>>> 
>>>>> For local LLM's, the likelihood of verbatim reproduction is differently 
>>>>> but apparently comparably unlikely because they have far fewer parameters 
>>>>> (32B vs. 671B for Deepseek for instance) of their pre-training corpus of 
>>>>> trillions (30T in the case of Qwen3-32B for instance), so the individual 
>>>>> tokens from the copyrighted material are highly unlikely to be actually 
>>>>> stored in the model to be reproduced, and certainly not in sequence. They 
>>>>> don't have the post-generation checks claimed by the SOTA models, but are 
>>>>> apparently considered in the "< 1 in 10,000 completions will generate 
>>>>> copyrighted code" territory.
>>>>> 
>>>>> When asked a human language prompt, or a multi-agent pipelined "still 
>>>>> human language but from your architect agent" prompt, the likelihood of 
>>>>> producing a string of copyrighted code in that manner is statistically 
>>>>> very, very low. I think we're at far more risk of contributors 
>>>>> copy/pasting stack overflow or code from other projects than we are from 
>>>>> modern genAI models producing blocks of copyrighted code.
>>>>> 
>>>>> All of which comes back to: if people disclose if they used AI, what 
>>>>> models, and whether they used the code or text the model wrote verbatim 
>>>>> or used it as a scaffolding and then heavily modified everything I think 
>>>>> we'll be in a pretty good spot.
>>>>> 
>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>>>>> 
>>>>>>> 2. Models that do not do output filtering to restrict the reproduction 
>>>>>>> of training data unless the tool can ensure the output is license 
>>>>>>> compatible?
>>>>>>> 
>>>>>>> 2 would basically prohibit locally run models.
>>>>>> 
>>>>>> 
>>>>>> I am not for this for the reasons listed above. There isn’t a difference 
>>>>>> between this and a contributor copying code and sending our way. We 
>>>>>> still need to validate the code can be accepted .
>>>>>> 
>>>>>> We also have the issue of having this be a broad stroke. If the user 
>>>>>> asked a model to write a test for the code the human wrote, we reject 
>>>>>> the contribution as they used a local model? This poses very little 
>>>>>> copywriting risk yet our policy would now reject
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 2. Models that do not do output filtering to restrict the reproduction 
>>>>>>> of training data unless the tool can ensure the output is license 
>>>>>>> compatible?
>>>>>>> 
>>>>>>> 2 would basically prohibit locally run models.
>>>>> 
>>> 
>

Re: Accepting AI generated contributions

Reply via email to