Hello,
The world moved forward, this is a fact. At the same time, most of
people pushing their stuff to github, or other repository hosting
solutions, rarely populate license information, and do not provide
explicit patent rights.
I agree that forbidding specific tools sound ridiculous, however none of
tools you mentioned becomes "creative" and does the "authoring" part for
you, nor imply extra licensing terms on what you authored.
It is more of a legal concern which impact project and ASF as a whole.
There are two main aspects - copyright and parent licenses which are
specified by ICLA in points 2-4 and 5.
If work submitted by contributor does not satisfy these by any reason,
and gets accepted it will poses risk to project, ASF. Maybe end users
too, depending on the legal system they reside in (IANAL).
PS. I'm not on any of the sides in the discussion.
Cheers,
Łukasz
On 7/23/25 22:51, Patrick McFadin wrote:
This is starting to get ridiculous. Disclosure statements on exactly how
a problem was solved? What’s next? Time cards?
It’s time to accept the world as it is. AI is in the coding toolbox now
just like IDEs, linters and code formatters. Some may not like using
them, some may love using them. What matters is that a problem was
solved, the code matches whatever quality standard the project upholds
which should be enforced by testing and code reviews.
Patrick
On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com
<mailto:dcapw...@apple.com>> wrote:
David is disclosing it in the maillist and the GH page. Should the
disclosure be persisted in the commit?
Someone asked me to update the ML, but I don’t believe or agree with
us assuming we should do this for every PR; personally storing this
in the PR description is fine to me as you are telling the reviewers
(who you need to communicate this to).
I’d say we can use the co-authored part of our commit messages to
disclose the actual AI that was used?
Heh... I kinda feel dirty doing that… No one does that when they
take something from a blog or stack overflow, but when you do that
you should still attribute by linking… which I guess is what Co-
Authored does?
I don’t know… feels dirty...
On Jul 23, 2025, at 11:19 AM, Bernardo Botella
<conta...@bernardobotella.com
<mailto:conta...@bernardobotella.com>> wrote:
That’s a great point. I’d say we can use the co-authored part of
our commit messages to disclose the actual AI that was used?
On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com
<mailto:yc25c...@gmail.com>> wrote:
Curious, what are the good ways to disclose the information?
> All of which comes back to: if people disclose if they used AI,
what models, and whether they used the code or text the model
wrote verbatim or used it as a scaffolding and then heavily
modified everything I think we'll be in a pretty good spot.
David is disclosing it in the maillist and the GH page. Should
the disclosure be persisted in the commit?
- Yifan
On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com
<mailto:dcapw...@apple.com>> wrote:
Sent out this patch that was written 100% by Claude: https://
github.com/apache/cassandra/pull/4266 <https://github.com/
apache/cassandra/pull/4266>
Claudes license doesn’t have issues with the current ASF
policy as far as I can tell. If you look at the patch it’s
very clear there isn’t any copywriter material (its glueing
together C* classes).
I could have written this my self but I had to focus on code
reviews and also needed this patch out, so asked Claude to
write it for me so I could focus on reviews. I have reviewed
it myself and it’s basically the same code I would have
written (notice how small and focused the patch is, larger
stuff doesn’t normally pass my peer review).
On Jun 25, 2025, at 2:37 PM, David Capwell
<dcapw...@apple.com <mailto:dcapw...@apple.com>> wrote:
+1 to what Josh said
Sent from my iPhone
On Jun 25, 2025, at 1:18 PM, Josh McKenzie
<jmcken...@apache.org <mailto:jmcken...@apache.org>> wrote:
Did some more digging. Apparently the way a lot of
headline-grabbers have been making models reproduce code
verbatim is to prompt them with dozens of verbatim tokens
of copyrighted code as input where completion is then very
heavily weighted to regurgitate the initial implementation.
Which makes sense; if you copy/paste 100 lines of
copyrighted code, the statistically likely completion for
that will be that initial implementation.
For local LLM's, the likelihood of verbatim reproduction
is /differently/ but apparently comparably unlikely because
they have far fewer parameters (32B vs. 671B for Deepseek
for instance) of their pre-training corpus of trillions
(30T in the case of Qwen3-32B for instance), so the
individual tokens from the copyrighted material are highly
unlikely to be actually /stored/ in the model to be
reproduced, and certainly not in sequence. They don't have
the post-generation checks claimed by the SOTA models, but
are apparently considered in the "< 1 in 10,000 completions
will generate copyrighted code" territory.
When asked a human language prompt, or a multi-agent
pipelined "still human language but from your architect
agent" prompt, the likelihood of producing a string of
copyrighted code in that manner is statistically very, very
low. I think we're at far more risk of contributors copy/
pasting stack overflow or code from other projects than we
are from modern genAI models producing blocks of
copyrighted code.
All of which comes back to: if people disclose if they used
AI, what models, and whether they used the code or text the
model wrote verbatim or used it as a scaffolding and then
heavily modified everything I think we'll be in a pretty
good spot.
On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
2. Models that do not do output filtering to restrict
the reproduction of training data unless the tool can
ensure the output is license compatible?
2 would basically prohibit locally run models.
I am not for this for the reasons listed above. There
isn’t a difference between this and a contributor copying
code and sending our way. We still need to validate the
code can be accepted .
We also have the issue of having this be a broad stroke.
If the user asked a model to write a test for the code the
human wrote, we reject the contribution as they used a
local model? This poses very little copywriting risk yet
our policy would now reject
Sent from my iPhone
On Jun 25, 2025, at 9:10 AM, Ariel Weisberg
<ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
2. Models that do not do output filtering to restrict the
reproduction of training data unless the tool can ensure
the output is license compatible?
2 would basically prohibit locally run models.