Re: Accepting AI generated contributions

Jon Haddad Wed, 23 Jul 2025 13:31:21 -0700

+1 to Patrick's proposal.

On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <pmcfa...@gmail.com> wrote:


> I just did some review on all the case law around copywrite and AI code.
> So far, every claim has been dismissed. There are some other cases like
> NYTimes which have more merit and are proceeding.
>
> Which leads me to the opinion that this is feeling like a premature
> optimization. Somebody creating a PR should not have to also submit a SBOM,
> which is essentially what we’re asking. It’s undue burden and friction on
> the process when we should be looking for ways to reduce friction.
>
> My proposal is no disclosures required.
>
> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <yc25c...@gmail.com> wrote:
>
>> According to the thread, the disclosure is for legal purposes. For
>> example, the patch is not produced by OpenAI's service. I think having the
>> discussion to clarify the AI usage in the projects is meaningful. I guess
>> many are hesitating because of the unclarity in the area.
>>
>> > I don’t believe or agree with us assuming we should do this for every PR
>>
>> I am with you, David. Updating the mail list for PRs is overwhelming for
>> both the author and the community.
>>
>> I also do not feel co-author is the best place.
>>
>> - Yifan
>>
>> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <pmcfa...@gmail.com>
>> wrote:
>>
>>> This is starting to get ridiculous. Disclosure statements on exactly how
>>> a problem was solved? What’s next? Time cards?
>>>
>>> It’s time to accept the world as it is. AI is in the coding toolbox now
>>> just like IDEs, linters and code formatters. Some may not like using them,
>>> some may love using them. What matters is that a problem was solved, the
>>> code matches whatever quality standard the project upholds which should be
>>> enforced by testing and code reviews.
>>>
>>> Patrick
>>>
>>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com>
>>> wrote:
>>>
>>>> David is disclosing it in the maillist and the GH page. Should the
>>>> disclosure be persisted in the commit?
>>>>
>>>>
>>>> Someone asked me to update the ML, but I don’t believe or agree with us
>>>> assuming we should do this for every PR; personally storing this in the PR
>>>> description is fine to me as you are telling the reviewers (who you need to
>>>> communicate this to).
>>>>
>>>>
>>>> I’d say we can use the co-authored part of our commit messages to
>>>> disclose the actual AI that was used?
>>>>
>>>>
>>>> Heh... I kinda feel dirty doing that… No one does that when they take
>>>> something from a blog or stack overflow, but when you do that you should
>>>> still attribute by linking… which I guess is what Co-Authored does?
>>>>
>>>> I don’t know… feels dirty...
>>>>
>>>>
>>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella <
>>>> conta...@bernardobotella.com> wrote:
>>>>
>>>> That’s a great point. I’d say we can use the co-authored part of our
>>>> commit messages to disclose the actual AI that was used?
>>>>
>>>>
>>>>
>>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com> wrote:
>>>>
>>>> Curious, what are the good ways to disclose the information?
>>>>
>>>> > All of which comes back to: if people disclose if they used AI, what
>>>> models, and whether they used the code or text the model wrote verbatim or
>>>> used it as a scaffolding and then heavily modified everything I think we'll
>>>> be in a pretty good spot.
>>>>
>>>> David is disclosing it in the maillist and the GH page. Should the
>>>> disclosure be persisted in the commit?
>>>>
>>>> - Yifan
>>>>
>>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com>
>>>> wrote:
>>>>
>>>>> Sent out this patch that was written 100% by Claude:
>>>>> https://github.com/apache/cassandra/pull/4266
>>>>>
>>>>> Claudes license doesn’t have issues with the current ASF policy as far
>>>>> as I can tell.  If you look at the patch it’s very clear there isn’t any
>>>>> copywriter material (its glueing together C* classes).
>>>>>
>>>>> I could have written this my self but I had to focus on code reviews
>>>>> and also needed this patch out, so asked Claude to write it for me so I
>>>>> could focus on reviews.  I have reviewed it myself and it’s basically the
>>>>> same code I would have written (notice how small and focused the patch is,
>>>>> larger stuff doesn’t normally pass my peer review).
>>>>>
>>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> wrote:
>>>>>
>>>>> +1 to what Josh said
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org>
>>>>> wrote:
>>>>>
>>>>> 
>>>>> Did some more digging. Apparently the way a lot of headline-grabbers
>>>>> have been making models reproduce code verbatim is to prompt them with
>>>>> dozens of verbatim tokens of copyrighted code as input where completion is
>>>>> then very heavily weighted to regurgitate the initial implementation. 
>>>>> Which
>>>>> makes sense; if you copy/paste 100 lines of copyrighted code, the
>>>>> statistically likely completion for that will be that initial
>>>>> implementation.
>>>>>
>>>>> For local LLM's, the likelihood of verbatim reproduction is
>>>>> *differently* but apparently comparably unlikely because they have
>>>>> far fewer parameters (32B vs. 671B for Deepseek for instance) of their
>>>>> pre-training corpus of trillions (30T in the case of Qwen3-32B for
>>>>> instance), so the individual tokens from the copyrighted material are
>>>>> highly unlikely to be actually *stored* in the model to be
>>>>> reproduced, and certainly not in sequence. They don't have the
>>>>> post-generation checks claimed by the SOTA models, but are apparently
>>>>> considered in the "< 1 in 10,000 completions will generate copyrighted
>>>>> code" territory.
>>>>>
>>>>> When asked a human language prompt, or a multi-agent pipelined "still
>>>>> human language but from your architect agent" prompt, the likelihood of
>>>>> producing a string of copyrighted code in that manner is statistically
>>>>> very, very low. I think we're at far more risk of contributors 
>>>>> copy/pasting
>>>>> stack overflow or code from other projects than we are from modern genAI
>>>>> models producing blocks of copyrighted code.
>>>>>
>>>>> All of which comes back to: if people disclose if they used AI, what
>>>>> models, and whether they used the code or text the model wrote verbatim or
>>>>> used it as a scaffolding and then heavily modified everything I think 
>>>>> we'll
>>>>> be in a pretty good spot.
>>>>>
>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>>>>
>>>>>
>>>>> 2. Models that do not do output filtering to restrict the reproduction
>>>>> of training data unless the tool can ensure the output is license
>>>>> compatible?
>>>>>
>>>>> 2 would basically prohibit locally run models.
>>>>>
>>>>>
>>>>> I am not for this for the reasons listed above. There isn’t a
>>>>> difference between this and a contributor copying code and sending our 
>>>>> way.
>>>>> We still need to validate the code can be accepted .
>>>>>
>>>>> We also have the issue of having this be a broad stroke. If the user
>>>>> asked a model to write a test for the code the human wrote, we reject the
>>>>> contribution as they used a local model? This poses very little 
>>>>> copywriting
>>>>> risk yet our policy would now reject
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>>>
>>>>> 2. Models that do not do output filtering to restrict the reproduction
>>>>> of training data unless the tool can ensure the output is license
>>>>> compatible?
>>>>>
>>>>> 2 would basically prohibit locally run models.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>

Re: Accepting AI generated contributions

Reply via email to