Re: Accepting AI generated contributions

David Capwell Wed, 23 Jul 2025 08:47:45 -0700

Sent out this patch that was written 100% by Claude: 
https://github.com/apache/cassandra/pull/4266


Claudes license doesn’t have issues with the current ASF policy as far as I can 
tell.  If you look at the patch it’s very clear there isn’t any copywriter 
material (its glueing together C* classes).

I could have written this my self but I had to focus on code reviews and also 
needed this patch out, so asked Claude to write it for me so I could focus on 
reviews.  I have reviewed it myself and it’s basically the same code I would 
have written (notice how small and focused the patch is, larger stuff doesn’t 
normally pass my peer review).

> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> wrote:
> 
> +1 to what Josh said
> Sent from my iPhone
> 
>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> wrote:
>> 
>> 
>> Did some more digging. Apparently the way a lot of headline-grabbers have 
>> been making models reproduce code verbatim is to prompt them with dozens of 
>> verbatim tokens of copyrighted code as input where completion is then very 
>> heavily weighted to regurgitate the initial implementation. Which makes 
>> sense; if you copy/paste 100 lines of copyrighted code, the statistically 
>> likely completion for that will be that initial implementation.
>> 
>> For local LLM's, the likelihood of verbatim reproduction is differently but 
>> apparently comparably unlikely because they have far fewer parameters (32B 
>> vs. 671B for Deepseek for instance) of their pre-training corpus of 
>> trillions (30T in the case of Qwen3-32B for instance), so the individual 
>> tokens from the copyrighted material are highly unlikely to be actually 
>> stored in the model to be reproduced, and certainly not in sequence. They 
>> don't have the post-generation checks claimed by the SOTA models, but are 
>> apparently considered in the "< 1 in 10,000 completions will generate 
>> copyrighted code" territory.
>> 
>> When asked a human language prompt, or a multi-agent pipelined "still human 
>> language but from your architect agent" prompt, the likelihood of producing 
>> a string of copyrighted code in that manner is statistically very, very low. 
>> I think we're at far more risk of contributors copy/pasting stack overflow 
>> or code from other projects than we are from modern genAI models producing 
>> blocks of copyrighted code.
>> 
>> All of which comes back to: if people disclose if they used AI, what models, 
>> and whether they used the code or text the model wrote verbatim or used it 
>> as a scaffolding and then heavily modified everything I think we'll be in a 
>> pretty good spot.
>> 
>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>> 
>>>> 2. Models that do not do output filtering to restrict the reproduction of 
>>>> training data unless the tool can ensure the output is license compatible?
>>>> 
>>>> 2 would basically prohibit locally run models.
>>> 
>>> 
>>> I am not for this for the reasons listed above. There isn’t a difference 
>>> between this and a contributor copying code and sending our way. We still 
>>> need to validate the code can be accepted .
>>> 
>>> We also have the issue of having this be a broad stroke. If the user asked 
>>> a model to write a test for the code the human wrote, we reject the 
>>> contribution as they used a local model? This poses very little copywriting 
>>> risk yet our policy would now reject
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>> 2. Models that do not do output filtering to restrict the reproduction of 
>>>> training data unless the tool can ensure the output is license compatible?
>>>> 
>>>> 2 would basically prohibit locally run models.
>>

Re: Accepting AI generated contributions

Reply via email to