Re: [DISCUSS] Add JVector as a dependency for CEP-30

Josh McKenzie Fri, 22 Sep 2023 07:44:12 -0700

> I highly doubt liability works like that in all jurisdictions
That's a fantastic point. When speculating there, I overlooked the fact that 
there are literally dozens of legal jurisdictions in which this project is used 
and the foundation operates.


As a PMC let's take this to legal.

On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:
> To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, 
> concrete) opinion.
> 
> 
> On Fri, Sep 22, 2023 at 5:59 AM Benedict <bened...@apache.org> wrote:
>> 
>>>>  1. my understanding is that with the former the liability rests on the 
>>>> provider of the lib to ensure it's in compliance with their claims to 
>>>> copyright
>> I highly doubt liability works like that in all jurisdictions, even if it 
>> might in some. I can even think of some historic cases related to Linux 
>> where patent trolls went after users of Linux, though I’m not sure where 
>> that got to and I don’t remember all the details.
>> 
>> But anyway, none of us are lawyers and we shouldn’t be depending on this 
>> kind of analysis. At minimum we should invite legal to proffer an opinion on 
>> whether dependencies are a valid loophole to the policy.
>> 
>> 
>> 
>>> On 22 Sep 2023, at 13:48, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:
>>> 
>>> 
>>> This Gen AI generated code use thread should probably be its own mailing 
>>> list DISCUSS thread?  It applies to all source code we take in, and accept 
>>> copyright assignment of, not to jars we depend on and not only to vector 
>>> related code contributions.
>>> 
>>>> On Sep 22, 2023, at 7:29 AM, Josh McKenzie <jmcken...@apache.org> wrote:
>>>> 
>>>> So if we're going to chat about GenAI on this thread here, 2 things:
>>>>  1. A dependency we pull in != a code contribution (I am not a lawyer but 
>>>> my understanding is that with the former the liability rests on the 
>>>> provider of the lib to ensure it's in compliance with their claims to 
>>>> copyright and it's not sticky). Easier to transition to a different dep if 
>>>> there's something API compatible or similar.
>>>>  2. With code contributions we take in, we take on some exposure in terms 
>>>> of copyright and infringement. git revert can be painful.
>>>> For this thread, here's an excerpt from the ASF policy:
>>>>> a recommended practice when using generative AI tooling is to use tools 
>>>>> with features that identify any included content that is similar to parts 
>>>>> of the tool’s training data, as well as the license of that content.
>>>>> 
>>>>> Given the above, code generated in whole or in part using AI can be 
>>>>> contributed if the contributor ensures that:
>>>>> 
>>>>>  1. The terms and conditions of the generative AI tool do not place any 
>>>>> restrictions on use of the output that would be inconsistent with the 
>>>>> Open Source Definition (e.g., ChatGPT’s terms are inconsistent).
>>>>>  2. At least one of the following conditions is met:
>>>>>    1. The output is not copyrightable subject matter (and would not be 
>>>>> even if produced by a human)
>>>>>    2. No third party materials are included in the output
>>>>>    3. Any third party materials that are included in the output are being 
>>>>> used with permission (e.g., under a compatible open source license) of 
>>>>> the third party copyright holders and in compliance with the applicable 
>>>>> license terms
>>>>>  3. A contributor obtain reasonable certainty that conditions 2.2 or 2.3 
>>>>> are met if the AI tool itself provides sufficient information about 
>>>>> materials that may have been copied, or from code scanning results
>>>>>    1. E.g. AWS CodeWhisperer recently added a feature that provides 
>>>>> notice and attribution
>>>>> When providing contributions authored using generative AI tooling, a 
>>>>> recommended practice is for contributors to indicate the tooling used to 
>>>>> create the contribution. This should be included as a token in the source 
>>>>> control commit message, for example including the phrase “Generated-by
>>>>> 
>>>> 
>>>> I think the real challenge right now is ensuring that the output from an 
>>>> LLM doesn't include a string of tokens that's identical to something in 
>>>> its input training dataset if it's trained on non-permissively licensed 
>>>> inputs. That plus the risk of, at least in the US, the courts landing on 
>>>> the side of saying that not only is the output of generative AI not 
>>>> copyrightable, but that there's legal liability on either the users of the 
>>>> tools or the creators of the models for some kind of copyright 
>>>> infringement. That can be sticky; if we take PR's that end up with that 
>>>> liability exposure, we end up in a place where either the foundation could 
>>>> be legally exposed and/or we'd need to revert some pretty invasive code / 
>>>> changes.
>>>> 
>>>> For example, Microsoft and OpenAI have publicly committed to paying legal 
>>>> fees for people sued for copyright infringement for using their tools: 
>>>> https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view
>>>>  
>>>> <https://urldefense.com/v3/__https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view__;!!PbtH5S7Ebw!ayp8v3C0XGwLhCQCu_FuLfvUz7V4Jgg5JGVkJGJl6DenfyeGqFvD_RAERDUr7koCoiLAnkz8q3QoF3fBz7fZ$>.
>>>>  Pretty interesting, and not a step a provider would take in an 
>>>> environment where things were legally clear and settled.
>>>> 
>>>> So while the usage of these things is apparently incredibly pervasive 
>>>> right now, "everybody is doing it" is a pretty high risk legal defense. :)
>>>> 
>>>> On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:
>>>>> 
>>>>> 
>>>>> On Thu, 21 Sept 2023 at 10:41, Benedict <bened...@apache.org> wrote:
>>>>>> 
>>>>>> At some point we have to discuss this, and here’s as good a place as 
>>>>>> any. There’s a great news article published talking about how generative 
>>>>>> AI was used to assist in developing the new vector search feature, which 
>>>>>> is itself really cool. Unfortunately it *sounds* like it runs afoul of 
>>>>>> the ASF legal policy on use for contributions to the project. This 
>>>>>> proposal is to include a dependency, but I’m not sure if that avoids the 
>>>>>> issue, and I’m equally uncertain how much this issue is isolated to the 
>>>>>> dependency (or affects it at all?)
>>>>>> 
>>>>>> Anyway, this is an annoying discussion we need to have at some point, so 
>>>>>> raising it here now so we can figure it out.
>>>>>> 
>>>>>> [1] 
>>>>>> https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/
>>>>>>  
>>>>>> <https://urldefense.com/v3/__https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/__;!!PbtH5S7Ebw!fi6r5DJcCCQ5zE54pLuUNDEXRSukUWsbj9dtHaXQX2Fcr-xkwsPUZz4QJu_3z5VOCKTSUIeupeClXoy0$>
>>>>>> [2] https://www.apache.org/legal/generative-tooling.html
>>>>>> 
>>>>> 
>>>>> 
>>>>> My reading of the ASF's GenAI policy is that any generated work in the 
>>>>> jvector library (and cep-30 ?) are not copyrightable, and that makes them 
>>>>> ok for us to include.
>>>>> 
>>>>> If there was a trace to copyrighted work, or the tooling imposed a 
>>>>> copyright or restrictions, we would then have to take considerations.
>>>>

Re: [DISCUSS] Add JVector as a dependency for CEP-30

Reply via email to