+1 to Patrick's proposal. On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <pmcfa...@gmail.com> wrote:
> I just did some review on all the case law around copywrite and AI code. > So far, every claim has been dismissed. There are some other cases like > NYTimes which have more merit and are proceeding. > > Which leads me to the opinion that this is feeling like a premature > optimization. Somebody creating a PR should not have to also submit a SBOM, > which is essentially what we’re asking. It’s undue burden and friction on > the process when we should be looking for ways to reduce friction. > > My proposal is no disclosures required. > > On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <yc25c...@gmail.com> wrote: > >> According to the thread, the disclosure is for legal purposes. For >> example, the patch is not produced by OpenAI's service. I think having the >> discussion to clarify the AI usage in the projects is meaningful. I guess >> many are hesitating because of the unclarity in the area. >> >> > I don’t believe or agree with us assuming we should do this for every PR >> >> I am with you, David. Updating the mail list for PRs is overwhelming for >> both the author and the community. >> >> I also do not feel co-author is the best place. >> >> - Yifan >> >> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <pmcfa...@gmail.com> >> wrote: >> >>> This is starting to get ridiculous. Disclosure statements on exactly how >>> a problem was solved? What’s next? Time cards? >>> >>> It’s time to accept the world as it is. AI is in the coding toolbox now >>> just like IDEs, linters and code formatters. Some may not like using them, >>> some may love using them. What matters is that a problem was solved, the >>> code matches whatever quality standard the project upholds which should be >>> enforced by testing and code reviews. >>> >>> Patrick >>> >>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com> >>> wrote: >>> >>>> David is disclosing it in the maillist and the GH page. Should the >>>> disclosure be persisted in the commit? >>>> >>>> >>>> Someone asked me to update the ML, but I don’t believe or agree with us >>>> assuming we should do this for every PR; personally storing this in the PR >>>> description is fine to me as you are telling the reviewers (who you need to >>>> communicate this to). >>>> >>>> >>>> I’d say we can use the co-authored part of our commit messages to >>>> disclose the actual AI that was used? >>>> >>>> >>>> Heh... I kinda feel dirty doing that… No one does that when they take >>>> something from a blog or stack overflow, but when you do that you should >>>> still attribute by linking… which I guess is what Co-Authored does? >>>> >>>> I don’t know… feels dirty... >>>> >>>> >>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella < >>>> conta...@bernardobotella.com> wrote: >>>> >>>> That’s a great point. I’d say we can use the co-authored part of our >>>> commit messages to disclose the actual AI that was used? >>>> >>>> >>>> >>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com> wrote: >>>> >>>> Curious, what are the good ways to disclose the information? >>>> >>>> > All of which comes back to: if people disclose if they used AI, what >>>> models, and whether they used the code or text the model wrote verbatim or >>>> used it as a scaffolding and then heavily modified everything I think we'll >>>> be in a pretty good spot. >>>> >>>> David is disclosing it in the maillist and the GH page. Should the >>>> disclosure be persisted in the commit? >>>> >>>> - Yifan >>>> >>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com> >>>> wrote: >>>> >>>>> Sent out this patch that was written 100% by Claude: >>>>> https://github.com/apache/cassandra/pull/4266 >>>>> >>>>> Claudes license doesn’t have issues with the current ASF policy as far >>>>> as I can tell. If you look at the patch it’s very clear there isn’t any >>>>> copywriter material (its glueing together C* classes). >>>>> >>>>> I could have written this my self but I had to focus on code reviews >>>>> and also needed this patch out, so asked Claude to write it for me so I >>>>> could focus on reviews. I have reviewed it myself and it’s basically the >>>>> same code I would have written (notice how small and focused the patch is, >>>>> larger stuff doesn’t normally pass my peer review). >>>>> >>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> wrote: >>>>> >>>>> +1 to what Josh said >>>>> Sent from my iPhone >>>>> >>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> >>>>> wrote: >>>>> >>>>> >>>>> Did some more digging. Apparently the way a lot of headline-grabbers >>>>> have been making models reproduce code verbatim is to prompt them with >>>>> dozens of verbatim tokens of copyrighted code as input where completion is >>>>> then very heavily weighted to regurgitate the initial implementation. >>>>> Which >>>>> makes sense; if you copy/paste 100 lines of copyrighted code, the >>>>> statistically likely completion for that will be that initial >>>>> implementation. >>>>> >>>>> For local LLM's, the likelihood of verbatim reproduction is >>>>> *differently* but apparently comparably unlikely because they have >>>>> far fewer parameters (32B vs. 671B for Deepseek for instance) of their >>>>> pre-training corpus of trillions (30T in the case of Qwen3-32B for >>>>> instance), so the individual tokens from the copyrighted material are >>>>> highly unlikely to be actually *stored* in the model to be >>>>> reproduced, and certainly not in sequence. They don't have the >>>>> post-generation checks claimed by the SOTA models, but are apparently >>>>> considered in the "< 1 in 10,000 completions will generate copyrighted >>>>> code" territory. >>>>> >>>>> When asked a human language prompt, or a multi-agent pipelined "still >>>>> human language but from your architect agent" prompt, the likelihood of >>>>> producing a string of copyrighted code in that manner is statistically >>>>> very, very low. I think we're at far more risk of contributors >>>>> copy/pasting >>>>> stack overflow or code from other projects than we are from modern genAI >>>>> models producing blocks of copyrighted code. >>>>> >>>>> All of which comes back to: if people disclose if they used AI, what >>>>> models, and whether they used the code or text the model wrote verbatim or >>>>> used it as a scaffolding and then heavily modified everything I think >>>>> we'll >>>>> be in a pretty good spot. >>>>> >>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote: >>>>> >>>>> >>>>> 2. Models that do not do output filtering to restrict the reproduction >>>>> of training data unless the tool can ensure the output is license >>>>> compatible? >>>>> >>>>> 2 would basically prohibit locally run models. >>>>> >>>>> >>>>> I am not for this for the reasons listed above. There isn’t a >>>>> difference between this and a contributor copying code and sending our >>>>> way. >>>>> We still need to validate the code can be accepted . >>>>> >>>>> We also have the issue of having this be a broad stroke. If the user >>>>> asked a model to write a test for the code the human wrote, we reject the >>>>> contribution as they used a local model? This poses very little >>>>> copywriting >>>>> risk yet our policy would now reject >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>>> >>>>> 2. Models that do not do output filtering to restrict the reproduction >>>>> of training data unless the tool can ensure the output is license >>>>> compatible? >>>>> >>>>> 2 would basically prohibit locally run models. >>>>> >>>>> >>>>> >>>>> >>>> >>>>