According to the thread, the disclosure is for legal purposes. For example, the patch is not produced by OpenAI's service. I think having the discussion to clarify the AI usage in the projects is meaningful. I guess many are hesitating because of the unclarity in the area.
> I don’t believe or agree with us assuming we should do this for every PR I am with you, David. Updating the mail list for PRs is overwhelming for both the author and the community. I also do not feel co-author is the best place. - Yifan On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <pmcfa...@gmail.com> wrote: > This is starting to get ridiculous. Disclosure statements on exactly how a > problem was solved? What’s next? Time cards? > > It’s time to accept the world as it is. AI is in the coding toolbox now > just like IDEs, linters and code formatters. Some may not like using them, > some may love using them. What matters is that a problem was solved, the > code matches whatever quality standard the project upholds which should be > enforced by testing and code reviews. > > Patrick > > On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com> wrote: > >> David is disclosing it in the maillist and the GH page. Should the >> disclosure be persisted in the commit? >> >> >> Someone asked me to update the ML, but I don’t believe or agree with us >> assuming we should do this for every PR; personally storing this in the PR >> description is fine to me as you are telling the reviewers (who you need to >> communicate this to). >> >> >> I’d say we can use the co-authored part of our commit messages to >> disclose the actual AI that was used? >> >> >> Heh... I kinda feel dirty doing that… No one does that when they take >> something from a blog or stack overflow, but when you do that you should >> still attribute by linking… which I guess is what Co-Authored does? >> >> I don’t know… feels dirty... >> >> >> On Jul 23, 2025, at 11:19 AM, Bernardo Botella < >> conta...@bernardobotella.com> wrote: >> >> That’s a great point. I’d say we can use the co-authored part of our >> commit messages to disclose the actual AI that was used? >> >> >> >> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com> wrote: >> >> Curious, what are the good ways to disclose the information? >> >> > All of which comes back to: if people disclose if they used AI, what >> models, and whether they used the code or text the model wrote verbatim or >> used it as a scaffolding and then heavily modified everything I think we'll >> be in a pretty good spot. >> >> David is disclosing it in the maillist and the GH page. Should the >> disclosure be persisted in the commit? >> >> - Yifan >> >> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com> wrote: >> >>> Sent out this patch that was written 100% by Claude: >>> https://github.com/apache/cassandra/pull/4266 >>> >>> Claudes license doesn’t have issues with the current ASF policy as far >>> as I can tell. If you look at the patch it’s very clear there isn’t any >>> copywriter material (its glueing together C* classes). >>> >>> I could have written this my self but I had to focus on code reviews and >>> also needed this patch out, so asked Claude to write it for me so I could >>> focus on reviews. I have reviewed it myself and it’s basically the same >>> code I would have written (notice how small and focused the patch is, >>> larger stuff doesn’t normally pass my peer review). >>> >>> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> wrote: >>> >>> +1 to what Josh said >>> Sent from my iPhone >>> >>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> wrote: >>> >>> >>> Did some more digging. Apparently the way a lot of headline-grabbers >>> have been making models reproduce code verbatim is to prompt them with >>> dozens of verbatim tokens of copyrighted code as input where completion is >>> then very heavily weighted to regurgitate the initial implementation. Which >>> makes sense; if you copy/paste 100 lines of copyrighted code, the >>> statistically likely completion for that will be that initial >>> implementation. >>> >>> For local LLM's, the likelihood of verbatim reproduction is >>> *differently* but apparently comparably unlikely because they have far >>> fewer parameters (32B vs. 671B for Deepseek for instance) of their >>> pre-training corpus of trillions (30T in the case of Qwen3-32B for >>> instance), so the individual tokens from the copyrighted material are >>> highly unlikely to be actually *stored* in the model to be reproduced, >>> and certainly not in sequence. They don't have the post-generation checks >>> claimed by the SOTA models, but are apparently considered in the "< 1 in >>> 10,000 completions will generate copyrighted code" territory. >>> >>> When asked a human language prompt, or a multi-agent pipelined "still >>> human language but from your architect agent" prompt, the likelihood of >>> producing a string of copyrighted code in that manner is statistically >>> very, very low. I think we're at far more risk of contributors copy/pasting >>> stack overflow or code from other projects than we are from modern genAI >>> models producing blocks of copyrighted code. >>> >>> All of which comes back to: if people disclose if they used AI, what >>> models, and whether they used the code or text the model wrote verbatim or >>> used it as a scaffolding and then heavily modified everything I think we'll >>> be in a pretty good spot. >>> >>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote: >>> >>> >>> 2. Models that do not do output filtering to restrict the reproduction >>> of training data unless the tool can ensure the output is license >>> compatible? >>> >>> 2 would basically prohibit locally run models. >>> >>> >>> I am not for this for the reasons listed above. There isn’t a difference >>> between this and a contributor copying code and sending our way. We still >>> need to validate the code can be accepted . >>> >>> We also have the issue of having this be a broad stroke. If the user >>> asked a model to write a test for the code the human wrote, we reject the >>> contribution as they used a local model? This poses very little copywriting >>> risk yet our policy would now reject >>> >>> Sent from my iPhone >>> >>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>> >>> 2. Models that do not do output filtering to restrict the reproduction >>> of training data unless the tool can ensure the output is license >>> compatible? >>> >>> 2 would basically prohibit locally run models. >>> >>> >>> >>> >> >>