Sent out this patch that was written 100% by Claude: https://github.com/apache/cassandra/pull/4266
Claudes license doesn’t have issues with the current ASF policy as far as I can tell. If you look at the patch it’s very clear there isn’t any copywriter material (its glueing together C* classes). I could have written this my self but I had to focus on code reviews and also needed this patch out, so asked Claude to write it for me so I could focus on reviews. I have reviewed it myself and it’s basically the same code I would have written (notice how small and focused the patch is, larger stuff doesn’t normally pass my peer review). > On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> wrote: > > +1 to what Josh said > Sent from my iPhone > >> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> wrote: >> >> >> Did some more digging. Apparently the way a lot of headline-grabbers have >> been making models reproduce code verbatim is to prompt them with dozens of >> verbatim tokens of copyrighted code as input where completion is then very >> heavily weighted to regurgitate the initial implementation. Which makes >> sense; if you copy/paste 100 lines of copyrighted code, the statistically >> likely completion for that will be that initial implementation. >> >> For local LLM's, the likelihood of verbatim reproduction is differently but >> apparently comparably unlikely because they have far fewer parameters (32B >> vs. 671B for Deepseek for instance) of their pre-training corpus of >> trillions (30T in the case of Qwen3-32B for instance), so the individual >> tokens from the copyrighted material are highly unlikely to be actually >> stored in the model to be reproduced, and certainly not in sequence. They >> don't have the post-generation checks claimed by the SOTA models, but are >> apparently considered in the "< 1 in 10,000 completions will generate >> copyrighted code" territory. >> >> When asked a human language prompt, or a multi-agent pipelined "still human >> language but from your architect agent" prompt, the likelihood of producing >> a string of copyrighted code in that manner is statistically very, very low. >> I think we're at far more risk of contributors copy/pasting stack overflow >> or code from other projects than we are from modern genAI models producing >> blocks of copyrighted code. >> >> All of which comes back to: if people disclose if they used AI, what models, >> and whether they used the code or text the model wrote verbatim or used it >> as a scaffolding and then heavily modified everything I think we'll be in a >> pretty good spot. >> >> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote: >>> >>>> 2. Models that do not do output filtering to restrict the reproduction of >>>> training data unless the tool can ensure the output is license compatible? >>>> >>>> 2 would basically prohibit locally run models. >>> >>> >>> I am not for this for the reasons listed above. There isn’t a difference >>> between this and a contributor copying code and sending our way. We still >>> need to validate the code can be accepted . >>> >>> We also have the issue of having this be a broad stroke. If the user asked >>> a model to write a test for the code the human wrote, we reject the >>> contribution as they used a local model? This poses very little copywriting >>> risk yet our policy would now reject >>> >>> Sent from my iPhone >>> >>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>> 2. Models that do not do output filtering to restrict the reproduction of >>>> training data unless the tool can ensure the output is license compatible? >>>> >>>> 2 would basically prohibit locally run models. >>