> I highly doubt liability works like that in all jurisdictions That's a fantastic point. When speculating there, I overlooked the fact that there are literally dozens of legal jurisdictions in which this project is used and the foundation operates.
As a PMC let's take this to legal. On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote: > To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, > concrete) opinion. > > > On Fri, Sep 22, 2023 at 5:59 AM Benedict <bened...@apache.org> wrote: >> >>>> 1. my understanding is that with the former the liability rests on the >>>> provider of the lib to ensure it's in compliance with their claims to >>>> copyright >> I highly doubt liability works like that in all jurisdictions, even if it >> might in some. I can even think of some historic cases related to Linux >> where patent trolls went after users of Linux, though I’m not sure where >> that got to and I don’t remember all the details. >> >> But anyway, none of us are lawyers and we shouldn’t be depending on this >> kind of analysis. At minimum we should invite legal to proffer an opinion on >> whether dependencies are a valid loophole to the policy. >> >> >> >>> On 22 Sep 2023, at 13:48, J. D. Jordan <jeremiah.jor...@gmail.com> wrote: >>> >>> >>> This Gen AI generated code use thread should probably be its own mailing >>> list DISCUSS thread? It applies to all source code we take in, and accept >>> copyright assignment of, not to jars we depend on and not only to vector >>> related code contributions. >>> >>>> On Sep 22, 2023, at 7:29 AM, Josh McKenzie <jmcken...@apache.org> wrote: >>>> >>>> So if we're going to chat about GenAI on this thread here, 2 things: >>>> 1. A dependency we pull in != a code contribution (I am not a lawyer but >>>> my understanding is that with the former the liability rests on the >>>> provider of the lib to ensure it's in compliance with their claims to >>>> copyright and it's not sticky). Easier to transition to a different dep if >>>> there's something API compatible or similar. >>>> 2. With code contributions we take in, we take on some exposure in terms >>>> of copyright and infringement. git revert can be painful. >>>> For this thread, here's an excerpt from the ASF policy: >>>>> a recommended practice when using generative AI tooling is to use tools >>>>> with features that identify any included content that is similar to parts >>>>> of the tool’s training data, as well as the license of that content. >>>>> >>>>> Given the above, code generated in whole or in part using AI can be >>>>> contributed if the contributor ensures that: >>>>> >>>>> 1. The terms and conditions of the generative AI tool do not place any >>>>> restrictions on use of the output that would be inconsistent with the >>>>> Open Source Definition (e.g., ChatGPT’s terms are inconsistent). >>>>> 2. At least one of the following conditions is met: >>>>> 1. The output is not copyrightable subject matter (and would not be >>>>> even if produced by a human) >>>>> 2. No third party materials are included in the output >>>>> 3. Any third party materials that are included in the output are being >>>>> used with permission (e.g., under a compatible open source license) of >>>>> the third party copyright holders and in compliance with the applicable >>>>> license terms >>>>> 3. A contributor obtain reasonable certainty that conditions 2.2 or 2.3 >>>>> are met if the AI tool itself provides sufficient information about >>>>> materials that may have been copied, or from code scanning results >>>>> 1. E.g. AWS CodeWhisperer recently added a feature that provides >>>>> notice and attribution >>>>> When providing contributions authored using generative AI tooling, a >>>>> recommended practice is for contributors to indicate the tooling used to >>>>> create the contribution. This should be included as a token in the source >>>>> control commit message, for example including the phrase “Generated-by >>>>> >>>> >>>> I think the real challenge right now is ensuring that the output from an >>>> LLM doesn't include a string of tokens that's identical to something in >>>> its input training dataset if it's trained on non-permissively licensed >>>> inputs. That plus the risk of, at least in the US, the courts landing on >>>> the side of saying that not only is the output of generative AI not >>>> copyrightable, but that there's legal liability on either the users of the >>>> tools or the creators of the models for some kind of copyright >>>> infringement. That can be sticky; if we take PR's that end up with that >>>> liability exposure, we end up in a place where either the foundation could >>>> be legally exposed and/or we'd need to revert some pretty invasive code / >>>> changes. >>>> >>>> For example, Microsoft and OpenAI have publicly committed to paying legal >>>> fees for people sued for copyright infringement for using their tools: >>>> https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view >>>> >>>> <https://urldefense.com/v3/__https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view__;!!PbtH5S7Ebw!ayp8v3C0XGwLhCQCu_FuLfvUz7V4Jgg5JGVkJGJl6DenfyeGqFvD_RAERDUr7koCoiLAnkz8q3QoF3fBz7fZ$>. >>>> Pretty interesting, and not a step a provider would take in an >>>> environment where things were legally clear and settled. >>>> >>>> So while the usage of these things is apparently incredibly pervasive >>>> right now, "everybody is doing it" is a pretty high risk legal defense. :) >>>> >>>> On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote: >>>>> >>>>> >>>>> On Thu, 21 Sept 2023 at 10:41, Benedict <bened...@apache.org> wrote: >>>>>> >>>>>> At some point we have to discuss this, and here’s as good a place as >>>>>> any. There’s a great news article published talking about how generative >>>>>> AI was used to assist in developing the new vector search feature, which >>>>>> is itself really cool. Unfortunately it *sounds* like it runs afoul of >>>>>> the ASF legal policy on use for contributions to the project. This >>>>>> proposal is to include a dependency, but I’m not sure if that avoids the >>>>>> issue, and I’m equally uncertain how much this issue is isolated to the >>>>>> dependency (or affects it at all?) >>>>>> >>>>>> Anyway, this is an annoying discussion we need to have at some point, so >>>>>> raising it here now so we can figure it out. >>>>>> >>>>>> [1] >>>>>> https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/ >>>>>> >>>>>> <https://urldefense.com/v3/__https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/__;!!PbtH5S7Ebw!fi6r5DJcCCQ5zE54pLuUNDEXRSukUWsbj9dtHaXQX2Fcr-xkwsPUZz4QJu_3z5VOCKTSUIeupeClXoy0$> >>>>>> [2] https://www.apache.org/legal/generative-tooling.html >>>>>> >>>>> >>>>> >>>>> My reading of the ASF's GenAI policy is that any generated work in the >>>>> jvector library (and cep-30 ?) are not copyrightable, and that makes them >>>>> ok for us to include. >>>>> >>>>> If there was a trace to copyrighted work, or the tooling imposed a >>>>> copyright or restrictions, we would then have to take considerations. >>>>