Re: Accepting AI generated contributions

Łukasz Dywicki Wed, 23 Jul 2025 12:32:41 -0700

Hello,

The world moved forward, this is a fact. At the same time, most ofpeople pushing their stuff to github, or other repository hostingsolutions, rarely populate license information, and do not provideexplicit patent rights.I agree that forbidding specific tools sound ridiculous, however none oftools you mentioned becomes "creative" and does the "authoring" part foryou, nor imply extra licensing terms on what you authored.It is more of a legal concern which impact project and ASF as a whole.There are two main aspects - copyright and parent licenses which arespecified by ICLA in points 2-4 and 5.If work submitted by contributor does not satisfy these by any reason,and gets accepted it will poses risk to project, ASF. Maybe end userstoo, depending on the legal system they reside in (IANAL).


PS. I'm not on any of the sides in the discussion.

Cheers,
Łukasz


On 7/23/25 22:51, Patrick McFadin wrote:

This is starting to get ridiculous. Disclosure statements on exactly howa problem was solved? What’s next? Time cards?

It’s time to accept the world as it is. AI is in the coding toolbox nowjust like IDEs, linters and code formatters. Some may not like usingthem, some may love using them. What matters is that a problem wassolved, the code matches whatever quality standard the project upholdswhich should be enforced by testing and code reviews.


Patrick

On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com<mailto:dcapw...@apple.com>> wrote:

    David is disclosing it in the maillist and the GH page. Should the
disclosure be persisted in the commit?


    Someone asked me to update the ML, but I don’t believe or agree with
    us assuming we should do this for every PR; personally storing this
    in the PR description is fine to me as you are telling the reviewers
    (who you need to communicate this to).

    I’d say we can use the co-authored part of our commit messages to
disclose the actual AI that was used?


    Heh... I kinda feel dirty doing that… No one does that when they
    take something from a blog or stack overflow, but when you do that
    you should still attribute by linking… which I guess is what Co-
    Authored does?

    I don’t know… feels dirty...

    On Jul 23, 2025, at 11:19 AM, Bernardo Botella
    <conta...@bernardobotella.com
    <mailto:conta...@bernardobotella.com>> wrote:

    That’s a great point. I’d say we can use the co-authored part of
    our commit messages to disclose the actual AI that was used?

    On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com
    <mailto:yc25c...@gmail.com>> wrote:

    Curious, what are the good ways to disclose the information?

    > All of which comes back to: if people disclose if they used AI,
    what models, and whether they used the code or text the model
    wrote verbatim or used it as a scaffolding and then heavily
    modified everything I think we'll be in a pretty good spot.

    David is disclosing it in the maillist and the GH page. Should
    the disclosure be persisted in the commit?

    - Yifan

    On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com
    <mailto:dcapw...@apple.com>> wrote:

        Sent out this patch that was written 100% by Claude: https://
        github.com/apache/cassandra/pull/4266 <https://github.com/
        apache/cassandra/pull/4266>

        Claudes license doesn’t have issues with the current ASF
        policy as far as I can tell.  If you look at the patch it’s
        very clear there isn’t any copywriter material (its glueing
        together C* classes).

        I could have written this my self but I had to focus on code
        reviews and also needed this patch out, so asked Claude to
        write it for me so I could focus on reviews.  I have reviewed
        it myself and it’s basically the same code I would have
        written (notice how small and focused the patch is, larger
        stuff doesn’t normally pass my peer review).

        On Jun 25, 2025, at 2:37 PM, David Capwell
        <dcapw...@apple.com <mailto:dcapw...@apple.com>> wrote:

        +1 to what Josh said
        Sent from my iPhone

        On Jun 25, 2025, at 1:18 PM, Josh McKenzie
        <jmcken...@apache.org <mailto:jmcken...@apache.org>> wrote:

        
        Did some more digging. Apparently the way a lot of
        headline-grabbers have been making models reproduce code
        verbatim is to prompt them with dozens of verbatim tokens
        of copyrighted code as input where completion is then very
        heavily weighted to regurgitate the initial implementation.
        Which makes sense; if you copy/paste 100 lines of
        copyrighted code, the statistically likely completion for
        that will be that initial implementation.

        For local LLM's, the likelihood of verbatim reproduction
        is /differently/ but apparently comparably unlikely because
        they have far fewer parameters (32B vs. 671B for Deepseek
        for instance) of their pre-training corpus of trillions
        (30T in the case of Qwen3-32B for instance), so the
        individual tokens from the copyrighted material are highly
        unlikely to be actually /stored/ in the model to be
        reproduced, and certainly not in sequence. They don't have
        the post-generation checks claimed by the SOTA models, but
        are apparently considered in the "< 1 in 10,000 completions
        will generate copyrighted code" territory.

        When asked a human language prompt, or a multi-agent
        pipelined "still human language but from your architect
        agent" prompt, the likelihood of producing a string of
        copyrighted code in that manner is statistically very, very
        low. I think we're at far more risk of contributors copy/
        pasting stack overflow or code from other projects than we
        are from modern genAI models producing blocks of
        copyrighted code.

        All of which comes back to: if people disclose if they used
        AI, what models, and whether they used the code or text the
        model wrote verbatim or used it as a scaffolding and then
        heavily modified everything I think we'll be in a pretty
        good spot.

        On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:

            2. Models that do not do output filtering to restrict
            the reproduction of training data unless the tool can
            ensure the output is license compatible?

            2 would basically prohibit locally run models.



        I am not for this for the reasons listed above. There
        isn’t a difference between this and a contributor copying
        code and sending our way. We still need to validate the
        code can be accepted .

        We also have the issue of having this be a broad stroke.
        If the user asked a model to write a test for the code the
        human wrote, we reject the contribution as they used a
        local model? This poses very little copywriting risk yet
        our policy would now reject

        Sent from my iPhone

        On Jun 25, 2025, at 9:10 AM, Ariel Weisberg
        <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
        2. Models that do not do output filtering to restrict the
        reproduction of training data unless the tool can ensure
        the output is license compatible?

        2 would basically prohibit locally run models.

Re: Accepting AI generated contributions

Reply via email to