Re: Licensing for models and datasets

2024-05-12 Thread Tymon Dąbrowski
Afaik, our training and data augmentation code will be, out of necessity,
GPL3.

For the images included in the dataset, our artist suggested CC-BY with a
special clause that would ensure that the images will be CC-BY when used
normally, but allow no credit when used in our AI. (I suggested CC-0, but
the artist said that they wouldn't feel comfortable giving their artworks
with CC-0 to a dataset, so that won't work).
How should we phrase it?
I think the clause should allow for our AI and, for simplicity, other
usages in AI by the Krita team, but not in any other AIs. Or possibly just
for the Smart Inking AI? But how to say it?
Should we get some kind of legal advice or something?


Tiar


sob., 30 mar 2024 o 15:22 Cornelius Schumacher 
napisał(a):

> On 26.03.24 17:33, Volker Krause wrote:
> > On Montag, 25. März 2024 15:17:48 CET Halla Rempt wrote:
> >> We're looking into adding an experimental AI-based feature to Krita:
> >> automated inking. That gives us three components, and we're not sure
> about
> >> the license we should use for two of them: the model and the datase.
> Would
> >> CC be best here?
> >
> > Looking at https://community.kde.org/Policies/Licensing_Policy the
> closest
> > thing would either be "media" files (generalized to "data files") and
> thus CC-
> > BY-SA (and presumably CC-BY/CC0) or "source code" (xGPL, BSD/MIT).
>
> I don't think we can directly use the current licensing policy for ML
> models and datasets. But I suppose we should discuss extending it to
> cover these use cases as well.
>
> CC-BY or CC-BY-SA are not the best choice for data as their attribution
> requirements can make it impractical to work with data under these
> licenses. There are some good arguments why data should rather not be
> licensed at all
> (https://plus.pli.edu/Details/Details?fq=id:(352066-ATL2)). This would
> suggest to use CC0 as closest practical form of it.
>
> For models, attribution requirements seem to be less of an issue. But as
> Volker described the copyright situation is quite complicated and it's
> not clear yet, what consequences this will have in the future. From this
> point of view a permissive license could a good choice as it is likely
> to not create problems in the future. As the MIT is already mentioned in
> the licensing policy, maybe this is the best choice?
>
> In addition to the licensing itself it could also be good to consider
> how to convey more information about the openness of the system. Even if
> it wouldn't make a difference in terms of copyright for the user of a
> model, it still might be preferable to use models which are trained on
> free and open data. Some kind of labeling and making this transparent to
> end users could be a solution to that.
>
> In the context of the Sustainable Software goal we have a bit of
> discussion around the labeling. There are some ongoing efforts, such as
> OSI's attempt to define what Open AI actually should mean
> (https://opensource.org/deepdive), or Nextcloud's Ethical AI labeling
> system (https://nextcloud.com/blog/nextcloud-ethical-ai-rating/). Maybe
> it would be worth thinking about adopting something like that in KDE as
> well. Who would be interested to discuss this? We have it on the agenda
> for the upcoming Goals sprint end of April, but it might be worth
> extending this discussion if there is broader interest.
>
> --
> Cornelius Schumacher 
>


Re: Licensing for models and datasets

2024-03-30 Thread Cornelius Schumacher

On 26.03.24 17:33, Volker Krause wrote:

On Montag, 25. März 2024 15:17:48 CET Halla Rempt wrote:

We're looking into adding an experimental AI-based feature to Krita:
automated inking. That gives us three components, and we're not sure about
the license we should use for two of them: the model and the datase. Would
CC be best here?


Looking at https://community.kde.org/Policies/Licensing_Policy the closest
thing would either be "media" files (generalized to "data files") and thus CC-
BY-SA (and presumably CC-BY/CC0) or "source code" (xGPL, BSD/MIT).


I don't think we can directly use the current licensing policy for ML 
models and datasets. But I suppose we should discuss extending it to 
cover these use cases as well.


CC-BY or CC-BY-SA are not the best choice for data as their attribution 
requirements can make it impractical to work with data under these 
licenses. There are some good arguments why data should rather not be 
licensed at all 
(https://plus.pli.edu/Details/Details?fq=id:(352066-ATL2)). This would 
suggest to use CC0 as closest practical form of it.


For models, attribution requirements seem to be less of an issue. But as 
Volker described the copyright situation is quite complicated and it's 
not clear yet, what consequences this will have in the future. From this 
point of view a permissive license could a good choice as it is likely 
to not create problems in the future. As the MIT is already mentioned in 
the licensing policy, maybe this is the best choice?


In addition to the licensing itself it could also be good to consider 
how to convey more information about the openness of the system. Even if 
it wouldn't make a difference in terms of copyright for the user of a 
model, it still might be preferable to use models which are trained on 
free and open data. Some kind of labeling and making this transparent to 
end users could be a solution to that.


In the context of the Sustainable Software goal we have a bit of 
discussion around the labeling. There are some ongoing efforts, such as 
OSI's attempt to define what Open AI actually should mean 
(https://opensource.org/deepdive), or Nextcloud's Ethical AI labeling 
system (https://nextcloud.com/blog/nextcloud-ethical-ai-rating/). Maybe 
it would be worth thinking about adopting something like that in KDE as 
well. Who would be interested to discuss this? We have it on the agenda 
for the upcoming Goals sprint end of April, but it might be worth 
extending this discussion if there is broader interest.


--
Cornelius Schumacher 


Re: Licensing for models and datasets

2024-03-26 Thread Andrius Štikonas
There is also this document by Debian's Deep Learning Team that is worth 
looking at:

https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst

There they make a distinction between model and its artifacts and if model is 
non-free or trained on non-free data, then they consider its output artifacts 
to be proprietary.

Andrius 

2024 m. kovo 26 d., antradienis 16:33:56 GMT Volker Krause rašė:
> On Montag, 25. März 2024 15:17:48 CET Halla Rempt wrote:
> > We're looking into adding an experimental AI-based feature to Krita:
> > automated inking. That gives us three components, and we're not sure about
> > the license we should use for two of them: the model and the datase. Would
> > CC be best here?
> 
> Looking at https://community.kde.org/Policies/Licensing_Policy the closest
> thing would either be "media" files (generalized to "data files") and thus
> CC- BY-SA (and presumably CC-BY/CC0) or "source code" (xGPL, BSD/MIT).
> 
> I think this is a bit more tricky though, depending on whether we assume a
> model is derivative work of the input data, and whether the output generated
> from a model is derivative work of the model (and thus potentially
> derivative work of the input data). The industry assumption so far seems to
> be that at least one of those isn't derivative work (AFAIK that has yet to
> be legally tested though), but I'm not sure that interpretation is in the
> best interest of FOSS developers or artists...
> 
> One scenario that would work regardless I think is using a license with
> practically no constraints (CC0, MIT, etc), but that also offers no
> protection for the training or model data (which might or might not be what
> you want).
> 
> Any other scenario I can think of involving more protective licenses runs
> into interesting issues:
> - if the output is derivative work, Krita users would be bound by e.g. the
> attribution or share-alike requirements of the license (which I guess is not
> what you want).
> - a Bison/Flex style "code generator exception" to state that the model
> output is free of any license requirements regardless of the model license
> itself requires that either the model isn't derivative work of the input or
> that the input data is licensed in a way compatible with that.
> - In the latter case we are back to essentially unprotected CC0-like input,
> or a protective license with a special exception, which then gets awfully
> close to developing new licenses.
> 
> So I guess this boils down to how much protection you have in mind for the
> input and model data?
> 
> Interesting topic, sorry if my ramblings on this are of limited help :)
> 
> Regards,
> Volker






Re: Licensing for models and datasets

2024-03-26 Thread Tymon Dąbrowski
> If it's just links and metadata, then one of the various CCs is fine.
> Some popular datasets include entire images, in which case... I don't
know. I'd avoid that...
Well, we'll have the actual data, not just links. We might not release it
if we don't want to, though.

> 2.) If we don't own the data used to "train" the binary blob model, do we
even own the model?
I'd just ask the artist to license it all on CC-0, then we can use it
however we want. (CC-BY could be already too much since you could argue
that the model is the derivative of data, and the final picture a
derivative of the model, therefore derivative of the data).

Remember that we don't have those images or dataset yet and we can just
choose those that fit our needs, including the licensing.



pon., 25 mar 2024 o 17:33 Emmet O'Neill 
napisał(a):

> 1.) Does the dataset contain full, original training images or just plain
> text links to images?
>
> If it's just links and metadata, then one of the various CCs is fine.
> Some popular datasets include entire images, in which case... I don't
> know. I'd avoid that...
>
> 2.) If we don't own the data used to "train" the binary blob model, do we
> even own the model?
>
> Obviously only the owner of some work can license it to others.
> We're kind of in legal no-man's land with all this stuff, so I don't know
> and I don't expect you to know either, but it feels like due diligence to
> consider it.
> Does Intel have any suggestions about this? What do they do?
>
> Even doing my best to put aside my personal (well-documented) feelings on
> copyright and generative AI aside, I don't really understand the
> legal/licensing mechanics of all this stuff to help you make a good
> judgement here.
> If nothing else I'm curious to see where this stuff leads Krita.
>
> On Mon, Mar 25, 2024 at 7:17 AM Halla Rempt  wrote:
>
>> We're looking into adding an experimental AI-based feature to Krita:
>> automated inking. That gives us three components, and we're not sure about
>> the license we should use for two of them: the model and the datase. Would
>> CC be best here?
>>
>> Halla
>>
>>
>>


Re: Licensing for models and datasets

2024-03-26 Thread Volker Krause
On Montag, 25. März 2024 15:17:48 CET Halla Rempt wrote:
> We're looking into adding an experimental AI-based feature to Krita:
> automated inking. That gives us three components, and we're not sure about
> the license we should use for two of them: the model and the datase. Would
> CC be best here?

Looking at https://community.kde.org/Policies/Licensing_Policy the closest 
thing would either be "media" files (generalized to "data files") and thus CC-
BY-SA (and presumably CC-BY/CC0) or "source code" (xGPL, BSD/MIT).

I think this is a bit more tricky though, depending on whether we assume a 
model is derivative work of the input data, and whether the output generated 
from a model is derivative work of the model (and thus potentially derivative 
work of the input data). The industry assumption so far seems to be that at 
least one of those isn't derivative work (AFAIK that has yet to be legally 
tested though), but I'm not sure that interpretation is in the best interest 
of FOSS developers or artists...

One scenario that would work regardless I think is using a license with 
practically no constraints (CC0, MIT, etc), but that also offers no protection 
for the training or model data (which might or might not be what you want).

Any other scenario I can think of involving more protective licenses runs into 
interesting issues:
- if the output is derivative work, Krita users would be bound by e.g. the 
attribution or share-alike requirements of the license (which I guess is not 
what you want).
- a Bison/Flex style "code generator exception" to state that the model output 
is free of any license requirements regardless of the model license itself 
requires that either the model isn't derivative work of the input or that the 
input data is licensed in a way compatible with that.
- In the latter case we are back to essentially unprotected CC0-like input, or 
a protective license with a special exception, which then gets awfully close 
to developing new licenses.

So I guess this boils down to how much protection you have in mind for the 
input and model data?

Interesting topic, sorry if my ramblings on this are of limited help :)

Regards,
Volker

signature.asc
Description: This is a digitally signed message part.


Re: Licensing for models and datasets

2024-03-25 Thread Halla Rempt
Thanks for your input!

On maandag 25 maart 2024 16:36:36 CET Gilles Caulier wrote:
> Hi,
> 
> In digiKam, we use plenty of AI models to perform face detection, face
> recognition, eyes detection, photo quality (blur, noise, compression,
> etc.), photo subjects detection for auto keywords generation
> (monuments, animals, plants, objects, places, etc.).
> 
> Due to licensing and the heavy size of files, all models are stored
> outside the source code and downloaded on demand by the user.
> 
> https://files.kde.org/digikam/facesengine/
> https://files.kde.org/digikam/aestheticdetector/
> https://files.kde.org/digikam/autotags/
> 
> Voilà, if this can help you...
> 
> Best regards
> 
> Gilles Caulier
> 
> Le lun. 25 mars 2024 à 15:18, Halla Rempt  a écrit :
> >
> > We're looking into adding an experimental AI-based feature to Krita: 
> > automated inking. That gives us three components, and we're not sure about 
> > the license we should use for two of them: the model and the datase. Would 
> > CC be best here?
> >
> > Halla
> >
> >
> 






Re: Licensing for models and datasets

2024-03-25 Thread Gilles Caulier
Hi,

In digiKam, we use plenty of AI models to perform face detection, face
recognition, eyes detection, photo quality (blur, noise, compression,
etc.), photo subjects detection for auto keywords generation
(monuments, animals, plants, objects, places, etc.).

Due to licensing and the heavy size of files, all models are stored
outside the source code and downloaded on demand by the user.

https://files.kde.org/digikam/facesengine/
https://files.kde.org/digikam/aestheticdetector/
https://files.kde.org/digikam/autotags/

Voilà, if this can help you...

Best regards

Gilles Caulier

Le lun. 25 mars 2024 à 15:18, Halla Rempt  a écrit :
>
> We're looking into adding an experimental AI-based feature to Krita: 
> automated inking. That gives us three components, and we're not sure about 
> the license we should use for two of them: the model and the datase. Would CC 
> be best here?
>
> Halla
>
>


Licensing for models and datasets

2024-03-25 Thread Halla Rempt
We're looking into adding an experimental AI-based feature to Krita: automated 
inking. That gives us three components, and we're not sure about the license we 
should use for two of them: the model and the datase. Would CC be best here?

Halla