Re: Accepting AI generated contributions

Patrick McFadin Wed, 23 Jul 2025 11:53:39 -0700

This is starting to get ridiculous. Disclosure statements on exactly how a
problem was solved? What’s next? Time cards?


It’s time to accept the world as it is. AI is in the coding toolbox now
just like IDEs, linters and code formatters. Some may not like using them,
some may love using them. What matters is that a problem was solved, the
code matches whatever quality standard the project upholds which should be
enforced by testing and code reviews.

Patrick

On Wed, Jul 23, 2025 at 11:31 AM David Capwell <[email protected]> wrote:

> David is disclosing it in the maillist and the GH page. Should the
> disclosure be persisted in the commit?
>
>
> Someone asked me to update the ML, but I don’t believe or agree with us
> assuming we should do this for every PR; personally storing this in the PR
> description is fine to me as you are telling the reviewers (who you need to
> communicate this to).
>
>
> I’d say we can use the co-authored part of our commit messages to disclose
> the actual AI that was used?
>
>
> Heh... I kinda feel dirty doing that… No one does that when they take
> something from a blog or stack overflow, but when you do that you should
> still attribute by linking… which I guess is what Co-Authored does?
>
> I don’t know… feels dirty...
>
>
> On Jul 23, 2025, at 11:19 AM, Bernardo Botella <
> [email protected]> wrote:
>
> That’s a great point. I’d say we can use the co-authored part of our
> commit messages to disclose the actual AI that was used?
>
>
>
> On Jul 23, 2025, at 10:57 AM, Yifan Cai <[email protected]> wrote:
>
> Curious, what are the good ways to disclose the information?
>
> > All of which comes back to: if people disclose if they used AI, what
> models, and whether they used the code or text the model wrote verbatim or
> used it as a scaffolding and then heavily modified everything I think we'll
> be in a pretty good spot.
>
> David is disclosing it in the maillist and the GH page. Should the
> disclosure be persisted in the commit?
>
> - Yifan
>
> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <[email protected]> wrote:
>
>> Sent out this patch that was written 100% by Claude:
>> https://github.com/apache/cassandra/pull/4266
>>
>> Claudes license doesn’t have issues with the current ASF policy as far as
>> I can tell.  If you look at the patch it’s very clear there isn’t any
>> copywriter material (its glueing together C* classes).
>>
>> I could have written this my self but I had to focus on code reviews and
>> also needed this patch out, so asked Claude to write it for me so I could
>> focus on reviews.  I have reviewed it myself and it’s basically the same
>> code I would have written (notice how small and focused the patch is,
>> larger stuff doesn’t normally pass my peer review).
>>
>> On Jun 25, 2025, at 2:37 PM, David Capwell <[email protected]> wrote:
>>
>> +1 to what Josh said
>> Sent from my iPhone
>>
>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <[email protected]> wrote:
>>
>> 
>> Did some more digging. Apparently the way a lot of headline-grabbers have
>> been making models reproduce code verbatim is to prompt them with dozens of
>> verbatim tokens of copyrighted code as input where completion is then very
>> heavily weighted to regurgitate the initial implementation. Which makes
>> sense; if you copy/paste 100 lines of copyrighted code, the statistically
>> likely completion for that will be that initial implementation.
>>
>> For local LLM's, the likelihood of verbatim reproduction is *differently* but
>> apparently comparably unlikely because they have far fewer parameters (32B
>> vs. 671B for Deepseek for instance) of their pre-training corpus of
>> trillions (30T in the case of Qwen3-32B for instance), so the individual
>> tokens from the copyrighted material are highly unlikely to be actually
>> *stored* in the model to be reproduced, and certainly not in sequence.
>> They don't have the post-generation checks claimed by the SOTA models, but
>> are apparently considered in the "< 1 in 10,000 completions will generate
>> copyrighted code" territory.
>>
>> When asked a human language prompt, or a multi-agent pipelined "still
>> human language but from your architect agent" prompt, the likelihood of
>> producing a string of copyrighted code in that manner is statistically
>> very, very low. I think we're at far more risk of contributors copy/pasting
>> stack overflow or code from other projects than we are from modern genAI
>> models producing blocks of copyrighted code.
>>
>> All of which comes back to: if people disclose if they used AI, what
>> models, and whether they used the code or text the model wrote verbatim or
>> used it as a scaffolding and then heavily modified everything I think we'll
>> be in a pretty good spot.
>>
>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>
>>
>> 2. Models that do not do output filtering to restrict the reproduction of
>> training data unless the tool can ensure the output is license compatible?
>>
>> 2 would basically prohibit locally run models.
>>
>>
>> I am not for this for the reasons listed above. There isn’t a difference
>> between this and a contributor copying code and sending our way. We still
>> need to validate the code can be accepted .
>>
>> We also have the issue of having this be a broad stroke. If the user
>> asked a model to write a test for the code the human wrote, we reject the
>> contribution as they used a local model? This poses very little copywriting
>> risk yet our policy would now reject
>>
>> Sent from my iPhone
>>
>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <[email protected]> wrote:
>>
>> 2. Models that do not do output filtering to restrict the reproduction of
>> training data unless the tool can ensure the output is license compatible?
>>
>> 2 would basically prohibit locally run models.
>>
>>
>>
>>
>
>

Re: Accepting AI generated contributions

Reply via email to