Re: A policy on use of AI-generated content in Debian

2024-05-08 Thread Tiago Bortoletto Vaz
This is much more of a general note regarding this thread.

Apparently we are far from a consensus on an official Debian position
regarding the use of generative AI as a whole in the project.
We're therefore content to use the resources we have and let each team
handle the content using their own criteria -- though not what I
expected... expectations adjusted and all is fine :-)

I'd be particularly happy to incorporate suggestions from Zack and
others in the areas I work on in Debian. Thanks anyway to everyone for
the input, especially Mo Zhou, Russ and Zack. I hope this debate will
come up again at a time when we better understand the consequences of
all this.

On 2024-05-03 21:32, Mo Zhou wrote:
[...]

Bests,

--
tvaz



Re: Debian acronymns (was:Re: A policy on use of AI-generated content in Debian)

2024-05-06 Thread Andreas Ronnquist
On Mon, 6 May 2024 16:24:17 -0300,
Jack Warkentin wrote:

>Hi Everybody
>
>I am an 86-year-old long time user of Debian GNU/Linux (at least 20 years). 
>And I subscribe to the debian-project mailing list as well as a couple of 
>other debian mailing lists. I sometimes have problems understanding 
>abbreviations and acronymns used on these lists and occasionally in package 
>documentation.
>
>While reading this thread I could not understand the "NM" acronymn (and some 
>other abbreviations as well). I finally found out NM's meaning by looking at 
>https://www.debian.org/sitemap and reading "Debian New Members Corner".
>
>It would be helpful if Debian would create (and keep up-to-date) a web page of 
>acronymns and abbreviations used by Debian literati. Or is there already such 
>a one, but not listed on the site map?
>
>Regards
>

Something like this is available on the wiki:

https://wiki.debian.org/Glossary

best
-- Andreas Rönnquist
mailingli...@gusnan.se
gusnan@ðebian.org



Re: Debian acronymns (was:Re: A policy on use of AI-generated content in Debian)

2024-05-06 Thread Judit Foglszinger
Hi,

> It would be helpful if Debian would create (and keep up-to-date) a web 
> page of acronymns and abbreviations used by Debian literati. Or is there 
> already such a one, but not listed on the site map?

There is one in the wiki - https://wiki.debian.org/Glossary


signature.asc
Description: This is a digitally signed message part.


Debian acronymns (was:Re: A policy on use of AI-generated content in Debian)

2024-05-06 Thread Jack Warkentin

Hi Everybody

I am an 86-year-old long time user of Debian GNU/Linux (at least 20 
years). And I subscribe to the debian-project mailing list as well as a 
couple of other debian mailing lists. I sometimes have problems 
understanding abbreviations and acronymns used on these lists and 
occasionally in package documentation.


While reading this thread I could not understand the "NM" acronymn (and 
some other abbreviations as well). I finally found out NM's meaning by 
looking at https://www.debian.org/sitemap and reading "Debian New 
Members Corner".


It would be helpful if Debian would create (and keep up-to-date) a web 
page of acronymns and abbreviations used by Debian literati. Or is there 
already such a one, but not listed on the site map?


Regards

Jack

Jack Warkentin, phone 902-404-0457, email j...@eastlink.ca
39 Inverness Avenue, Halifax, Nova Scotia, Canada, B3P 1X6

Mo Zhou wrote:

On 5/3/24 12:10, Stefano Zacchiroli wrote:

On that front, useful "related work" are the policies that scientific
journals and conferences (which are exposed *a lot* to this, given their
main activity is vetting textual documents) have put in place about
this.

Indeed. Here are some examples:
Nature: https://www.nature.com/nature-portfolio/editorial-policies/ai
ICML: https://icml.cc/Conferences/2023/llm-policy
CVPR: https://cvpr.thecvf.com/Conferences/2024/ReviewerGuidelines
   https://cvpr.thecvf.com/Conferences/2024/AuthorGuidelines

Some additional points to the two from Stefano:
1. Nature does not allow LLM to be an author.
2. CVPR holds the author who used LLM responsible for all LLM's fault.
3. CVPR agrees that the paper reviewers skipping their work with LLM
     is harming the community.

The general policy usually contains two main points (paraphrased below):

(1) You are free to use AI tools to *improve* your content, but not to
 create it from scratch for you.

Polishing language is the case where I find LLMs most useful. But in fact,
as an author, when I really care about the quality of whatever I wrote,
I will find the state-of-the-art LLMs (such as ChatGPT4) poor in logic,
poor in understanding my deep insight. They eventually turn into a
smart language tutor to me.

(2) You need to disclose the fact you have used AI tools, and how you
 have used them.

Yes, It is commonly encouraged to acknowledge the use of AI tools.

Exactly as in your case, Tiago, people managing scientific journals and
conferences have absolutely no way of checking if these rules are
respected or not. (They have access to large-scale plagiarism detection
tools, which is a related but different concern.) They just ask people
to *state* they followed this policy upon submission, but that's it.

If the cheater who use LLM is lazy enough, not editing the LLM outputs
at all --- you will find it super easy to identify whether a chunk of text
is produced by LLM on your own. For example, I use ChatGPT basically 
everyday in

March, and its answers always feel like being organized in the same
format. No human answers questions in the same boring format all the time.

If your main concern is people using LLMs or the like in some of the
processes you mention, a checkbox requiring such a statement upon
submission might go a longer way than a project-wide statement (which
will sit in d-d-a unknown to n-m applicants a few years from now).

For the long run, there is no way to enforce a ban on the use of AI over
this project. What is doable, from my point of view, is to confirm that
a person acknowledges the issues, potential risk and implications of
the use of AI tools, and hold people who use AI to be responsible for
AI's fault.

Afterall, it's easy to identify one's intention of using AI -- it is either
for good or bad. If the NM applicants can easily get the answer of an
NM question, maybe it is time to refresh the question? Afterall nobody
can stop one from learning from AI outputs when they need suggestion
or reference answers -- and they are responsible for the wrong answer
if AI is wrong.

Apart from deliberately conducting bad acts using AIs, one thing that seems
benign but harmful to the community is slacking off and skipping important
work with AIs. But still, this can be covered by a single rule as well --
"Let the person who use AI to be responsible for AI's fault."

Simple, and doable.






Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Mo Zhou

On 5/3/24 12:10, Stefano Zacchiroli wrote:

On that front, useful "related work" are the policies that scientific
journals and conferences (which are exposed *a lot* to this, given their
main activity is vetting textual documents) have put in place about
this.

Indeed. Here are some examples:
Nature: https://www.nature.com/nature-portfolio/editorial-policies/ai
ICML: https://icml.cc/Conferences/2023/llm-policy
CVPR: https://cvpr.thecvf.com/Conferences/2024/ReviewerGuidelines
  https://cvpr.thecvf.com/Conferences/2024/AuthorGuidelines

Some additional points to the two from Stefano:
1. Nature does not allow LLM to be an author.
2. CVPR holds the author who used LLM responsible for all LLM's fault.
3. CVPR agrees that the paper reviewers skipping their work with LLM
    is harming the community.

The general policy usually contains two main points (paraphrased below):

(1) You are free to use AI tools to *improve* your content, but not to
 create it from scratch for you.

Polishing language is the case where I find LLMs most useful. But in fact,
as an author, when I really care about the quality of whatever I wrote,
I will find the state-of-the-art LLMs (such as ChatGPT4) poor in logic,
poor in understanding my deep insight. They eventually turn into a
smart language tutor to me.

(2) You need to disclose the fact you have used AI tools, and how you
 have used them.

Yes, It is commonly encouraged to acknowledge the use of AI tools.

Exactly as in your case, Tiago, people managing scientific journals and
conferences have absolutely no way of checking if these rules are
respected or not. (They have access to large-scale plagiarism detection
tools, which is a related but different concern.) They just ask people
to *state* they followed this policy upon submission, but that's it.

If the cheater who use LLM is lazy enough, not editing the LLM outputs
at all --- you will find it super easy to identify whether a chunk of text
is produced by LLM on your own. For example, I use ChatGPT basically 
everyday in

March, and its answers always feel like being organized in the same
format. No human answers questions in the same boring format all the time.

If your main concern is people using LLMs or the like in some of the
processes you mention, a checkbox requiring such a statement upon
submission might go a longer way than a project-wide statement (which
will sit in d-d-a unknown to n-m applicants a few years from now).

For the long run, there is no way to enforce a ban on the use of AI over
this project. What is doable, from my point of view, is to confirm that
a person acknowledges the issues, potential risk and implications of
the use of AI tools, and hold people who use AI to be responsible for
AI's fault.

Afterall, it's easy to identify one's intention of using AI -- it is either
for good or bad. If the NM applicants can easily get the answer of an
NM question, maybe it is time to refresh the question? Afterall nobody
can stop one from learning from AI outputs when they need suggestion
or reference answers -- and they are responsible for the wrong answer
if AI is wrong.

Apart from deliberately conducting bad acts using AIs, one thing that seems
benign but harmful to the community is slacking off and skipping important
work with AIs. But still, this can be covered by a single rule as well --
"Let the person who use AI to be responsible for AI's fault."

Simple, and doable.



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Sam Hartman
> "Tiago" == Tiago Bortoletto Vaz  writes:

Tiago> Hi Jose,
Tiago> Thanks for you input, I have a few comments:

Tiago> On Fri, May 03, 2024 at 11:02:47AM -0300, Jose-Luis Rivas wrote:
>> On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
>> > Right, note that they acknowledged this policy is a working in 
progress. Not
>> > perfect, but 'something needed to be done, quickly'. It's hard to find 
a
>> > balance here, but I kind of share this sense of urgency.
>> >
>> > [...]
>> >
>> > This point resonates with problems we might be facing already, for 
instance
>> > in the NM process and also in Debconf submissions (there's no point of 
going
>> > into details here because so far we can't proof anything, and even if 
we could,
>> > of course we wouldn't bring any of the involved to the public arena). 
So I'm
>> > actually more concerned about LLM being mindlessly applied in our 
communication
>> > processes (NM, bts, debconf, irc, planet, wiki, website, debian.net 
stuff, etc)
>> > than one using some AI-assisted code in our infra, at least for now.
>> >
>> 
>> Hi Tiago,
>> 
>> It seems you have more context than the rest which provides a sense of
>> urgency for you, where others do not have this same information and
>> can't share this sense of urgency.

Tiago> Yes.

Oh, wow, I had no idea that your argument for urgency came from the NM
case.

I actually think that NM is not benefitted from  a policy here.
We already have a fairly good standard: did you prove to your
application manager, your advocates, and the reviewers (FD or DAM as
appropriate) that you can be trusted and that you have the necessary
technical and other skills to be a DD.

I think it's fairly clear that using an LLM to answer questions in the
NM process does not show that you have the technical skills.
(Using it instead of reading a man page for similar results and then
going and doing the work might be fine, but cutting and pasting an
answer to an application question into the message you send to your AM
clearly doesn't demonstrate your own technical skill.)

I as an AM would find that an applicant using an LLM as more than a
possibly incorrect man page without telling me would violate trust.  I
don't need a policy to come to that conclusion.  I don't think I would
have any trouble convincing DAM or FD to back my decision.

I think coming up with a policy for this situation is going to be
tricky.

Do I mind an applicant asking an LLM to refresh their memory on how to
import a new upstream version?
No, not at all.
Do they need to cite the LLM in their answer?
If it really is a memory refresh and they know the material well enough
to  have confidence that the LLM answer is correct, I do not think they
need to cite.
If they don't know the material well enough to know the LLM is correct,
then and LLM is a bad choice.

But the same is true of a human I might ask.
If I asked you  to remind me  something about importing a new upstream,
and it really was just a reminder, I would not cite your contribution
unless I used a significant chunk of text you had written.
If you gave me bad info and I didn't catch it, then we learn I probably
should not be trusted to pick good sources for my education.

--Sam



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Russ Allbery
Stefano Zacchiroli  writes:

> (1) You are free to use AI tools to *improve* your content, but not to
> create it from scratch for you.

> This point is particular important for non-native English speakers,
> who can benefit a lot more than natives from tool support for tasks
> like proofreading/editing. I suspect the Debian community might be
> particularly sensible to this argument. (And note that on this one
> the barrier between ChatGPT-based proofreading and other grammar/
> style checkers will become more and more blurry in the future.)

This underlines a key point to me, which is that "AI" is a marketing term,
not a technical classification.  Even LLMs, a more technical
classification, can be designed to do different things, and I expect
hybrid models to become more widespread as the limitations of trying to do
literally everything via an LLM become more apparent.

Grammar checkers, automated translation, and autocorrect are all useful
tools in their appropriate place.  Some people have moral concerns about
how they're constructed and other people don't.  I'm not sure we'll have a
consensus on that.  So far, at least, there don't seem to be the sort of
legal challenges for those types of applications that there are for the
"write completely new text based on a prompt" tyle of LLM.

Just on a personal note, I do want to make a plea to non-native English
speakers to not feel like you need to replace your prose with something
generated by an LLM.

I don't want to understate the benefits of grammar checking, translation,
and other tools, and I don't want to underestimate the frustration and
difficulties in communicating in a non-native language.  I think ethical
tools to assist with that are great.  But I would much rather puzzle out
odd or less-than-fluent English, extend assumptions of good will, and work
through the occasional misunderstanding, if that means I can interact with
a real human voice.

I know, I know, supposedly this is all getting better, but so much of the
text produced by ChatGPT and similar tools today sounds like a McKinsey
consultant trying to sell war crimes to a marketing executive.  Yes, it's
precisely grammatical and well-structured English.  It's also sociopathic,
completely soulless, and almost impossible to concentrate on because it's
full of the sort of slippery phrases and opaque verbosity of a politician
trying to distract from some sort of major scandal.  I want to talk to
you, another human being, not to an LLM trained to sound like a corporate
web site.

-- 
Russ Allbery (r...@debian.org)  



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Tiago Bortoletto Vaz
Hi Jose,

Thanks for you input, I have a few comments:

On Fri, May 03, 2024 at 11:02:47AM -0300, Jose-Luis Rivas wrote:
> On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
> > Right, note that they acknowledged this policy is a working in progress. Not
> > perfect, but 'something needed to be done, quickly'. It's hard to find a
> > balance here, but I kind of share this sense of urgency.
> >
> > [...]
> >
> > This point resonates with problems we might be facing already, for instance
> > in the NM process and also in Debconf submissions (there's no point of going
> > into details here because so far we can't proof anything, and even if we 
> > could,
> > of course we wouldn't bring any of the involved to the public arena). So I'm
> > actually more concerned about LLM being mindlessly applied in our 
> > communication
> > processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, 
> > etc)
> > than one using some AI-assisted code in our infra, at least for now.
> >
> 
> Hi Tiago,
> 
> It seems you have more context than the rest which provides a sense of
> urgency for you, where others do not have this same information and
> can't share this sense of urgency.

Yes.

> If I were to assume based on the little context you shared, I would say
> there's someone doing a NM application using LLM, answering stuff with
> LLM and passing all their communications through LLMs.
> 
> In that case, there's even less point in making a policy about it, in my
> opinion. Since as you stated: you can't prove anything, and ultimately
> it would land in the hands of the people approving submissions or NMs to
> judge if the person is qualified or not. And you can't block
> communications from LLM generated content when you can't even prove it's
> LLM generated content. How to enforce it?

Hmm I tend to disagree here. Proving by investigation isn't the only way to get
some truth about the situation. We can get it by simply asking the person if
they used LLM to generate their work (be it an answer to NM questions, or a
contribution to Debian website, or an email to this mailing list...). In that
scenario, having a policy, a position statement or even a gentle guideline
would make a huge difference in the ongoing exchange.

> And I doubt a statement would do much, as well. What would be
> communicated? "Communications produced by LLMs are troublesome"? I don't
> know if there's much substance to have a statement of that sort.

Just to set the scene a little on how I think about the issue: when I brought
up this discussion, I didn't have in mind someone evil attempting to use AI to
deliberately disrupt the project. We know already that policies or statements
are never sufficient to deal with people in this category. Rather, I see many
people (mostly younger contributors) who're getting to use LLMs in their daily
life in a quite mindless way -- which of course is not our business if they do
so in their private life. However, the issues that can arise using this kind of
technology without much consideration in a community like Debian are not
obvious to everyone, and I don't expect every Debian contributor to have a
sufficiently good understanding of the matter, or maturity, at the moment they
start contributing to the project. We can draw some analogy here in relation to
the CoC and the Diversity Statement. They might seem quite obvious to some, and
less so to others.

So far I've felt a certain resistance to adopting something as sharp as Gentoo
did (which I've already agreed with). However, I still have the feeling that a
position in the form of a statement or even a guideline could help us both
avoid and mitigate possible problems in the future.

Bests,

--
tvaz



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Stefano Zacchiroli
On Thu, May 02, 2024 at 08:21:28PM -0400, Tiago Bortoletto Vaz wrote:
> So I'm actually more concerned about LLM being mindlessly applied in
> our communication processes (NM, bts, debconf, irc, planet, wiki,
> website, debian.net stuff, etc) than one using some AI-assisted code
> in our infra, at least for now.

On that front, useful "related work" are the policies that scientific
journals and conferences (which are exposed *a lot* to this, given their
main activity is vetting textual documents) have put in place about
this.

The general policy usually contains two main points (paraphrased below):

(1) You are free to use AI tools to *improve* your content, but not to
create it from scratch for you.

This point is particular important for non-native English speakers,
who can benefit a lot more than natives from tool support for tasks
like proofreading/editing. I suspect the Debian community might be
particularly sensible to this argument. (And note that on this one
the barrier between ChatGPT-based proofreading and other grammar/
style checkers will become more and more blurry in the future.)

(2) You need to disclose the fact you have used AI tools, and how you
have used them.

Exactly as in your case, Tiago, people managing scientific journals and
conferences have absolutely no way of checking if these rules are
respected or not. (They have access to large-scale plagiarism detection
tools, which is a related but different concern.) They just ask people
to *state* they followed this policy upon submission, but that's it.

If your main concern is people using LLMs or the like in some of the
processes you mention, a checkbox requiring such a statement upon
submission might go a longer way than a project-wide statement (which
will sit in d-d-a unknown to n-m applicants a few years from now).

Cheers
-- 
Stefano Zacchiroli . z...@upsilon.cc . https://upsilon.cc/zack  _. ^ ._
Full professor of Computer Science  o o   o \/|V|\/
Télécom Paris, Polytechnic Institute of Paris o o o   <\>
Co-founder & CTO Software Heritageo o o o   /\|^|/\
https://twitter.com/zacchiro . https://mastodon.xyz/@zacchiro   '" V "'



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Jose-Luis Rivas
On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
> Right, note that they acknowledged this policy is a working in progress. Not
> perfect, but 'something needed to be done, quickly'. It's hard to find a
> balance here, but I kind of share this sense of urgency.
>
> [...]
>
> This point resonates with problems we might be facing already, for instance
> in the NM process and also in Debconf submissions (there's no point of going
> into details here because so far we can't proof anything, and even if we 
> could,
> of course we wouldn't bring any of the involved to the public arena). So I'm
> actually more concerned about LLM being mindlessly applied in our 
> communication
> processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, 
> etc)
> than one using some AI-assisted code in our infra, at least for now.
>

Hi Tiago,

It seems you have more context than the rest which provides a sense of
urgency for you, where others do not have this same information and
can't share this sense of urgency.

If I were to assume based on the little context you shared, I would say
there's someone doing a NM application using LLM, answering stuff with
LLM and passing all their communications through LLMs.

In that case, there's even less point in making a policy about it, in my
opinion. Since as you stated: you can't prove anything, and ultimately
it would land in the hands of the people approving submissions or NMs to
judge if the person is qualified or not. And you can't block
communications from LLM generated content when you can't even prove it's
LLM generated content. How to enforce it?

And I doubt a statement would do much, as well. What would be
communicated? "Communications produced by LLMs are troublesome"? I don't
know if there's much substance to have a statement of that sort.

OTOH, LLM-assisted rewrite of your own content may help non-native
English speakers to write better and improve communication
effectiveness. Hence, saying "communications produced by LLMs are
troublesome" would be troublesome itself, since how can you as a
receiver differentiate if it's their own content or other's content.

Some may say "a statement could at least be used as a pointer to say
'these are our expectations regarding use of AI'", but ultimately is in
the hands of those judging to filter out or not. And if those judging
can't even prove if AI was used, what's the point?

I can't see the point of "something needs to be done" without a clear
reasoning of the expectations out of that being done.

--Jose



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Andrey Rakhmatullin
On Fri, May 03, 2024 at 01:04:29PM +0900, Charles Plessy wrote:
> If I would hear that other Debian developers use them in that context, I
> would seriously question whether there is any value to spend my
> volunteer time in keeping debian/copyright files accurate to the level
> of details our Policy asks for. 
There is a popular old opinion unrelated to AI that there is not.

-- 
WBR, wRAR


signature.asc
Description: PGP signature


Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Dominik George
Hi,

>> Generative AI tools **produce** derivatives of other people's copyrighted 
>> works.
>
>They *can* do that, but so can humans (and will). Humans look at a
>product or code and write new code that sometimes resembles the
>original very much.

Can I ask the LLM where it was probably inspired?

Can I show the LLM another work and ask it whether there might be a chance theu 
got inspired by it (and get a different answer than that it probably sucked in 
everything, so yes)?

Is there a chance that the LLM did not only read some source code, but also had 
friendly interactions with the author?

Admittedly, at that point, we get into philosophical questions, which I don't 
consider any less important for free software.

-nik



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Ansgar 



On Thu, 2024-05-02 at 20:47 +0200, Dominik George wrote:
> > t's just another tool that might or might not be non-free like people
> > using Photoshop, Google Chrome, Gmail, Windows, ... to make
> > contributions. Or a spamfilter to filter out some.
> 
> That's entirely not the point.
> 
> It is not about **the tool** being non-free, but the result of its use being 
> non-free.
> 
> Generative AI tools **produce** derivatives of other people's copyrighted 
> works.

They *can* do that, but so can humans (and will). Humans look at a
product or code and write new code that sometimes resembles the
original very much.

The claim "everything a generative AI tool outputs is a derivative
work" would be rather bold.

> That said, we already have the necessary policies in place:
> 
> * d/copyright must be accurate
> * all sources must be reproducible from their preferred form of modification
> 
> Both are not possible using generative AI.

They are, just as they are for human results.  The preferred form of
modification can be based on the output from a generative AI,
especially if it is further edited.

But this is not something new: a camera or microphone records data, but
we use the captured data (and not the original source) as the preferred
form of modification. Sometimes even after generous preprocessing by
(non-free) firmware.

(We don't include human neural network data as part of the preferred
form of modification either ;-))

Ansgar




Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Charles Plessy
Le Thu, May 02, 2024 at 02:01:20PM -0400, Tiago Bortoletto Vaz a écrit :
> 
> I would like Debian to discuss and decide on the usage of AI-generated content
> within the project.

Hi Tiago,

as a Debian developer I refrain from using commercial AI to genereate
code used in my packaging work or native packages, because I think that
these systems are copyright laundering machines that allow to suck the
energy invested in Free Sofware and transfer it in proprietary works
(and to a lower extent to un-GPL works).

If I would hear that other Debian developers use them in that context, I
would seriously question whether there is any value to spend my
volunteer time in keeping debian/copyright files accurate to the level
of details our Policy asks for.  When the world and ourselves will have
given up on respecting Free Software copyrights and passing attribution,
I will not see the point spending time doing more than the bare minimum,
for instance like in Anaconda, where you just get License: MIT and the
right to download the sources and check yourself the year of attribution
and names of contributors.

This said, I have not found time to try debgpt and feel guilty about
this.  If there will be a tool that is trained with source code for
which the authors gave their consent, where the license terms were
compatible, and provides its output under the most viral terms (probably
AGPL), I would love to use it and give attribution to community of
contributors to the software.

So in summary, I probably would vote for a GR calling against the use of
the current commercial AI for generating Debian packaging, native, or
infrastructure code, unless of course good arguments are further
provided against.  This said, I think that we can not and should not
control for people not respecting the call.

Have a nice day,

Charles

-- 
Charles Plessy Nagahama, Yomitan, Okinawa, Japan
Debian Med packaging team http://www.debian.org/devel/debian-med
Tooting from work,   https://fediscience.org/@charles_plessy
Tooting from home, https://framapiaf.org/@charles_plessy



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Tiago Bortoletto Vaz
Hi Russ,

On Thu, May 02, 2024 at 11:59:10AM -0700, Russ Allbery wrote:
> Tiago Bortoletto Vaz  writes:
> 
> > I personally agree with the author's rationale on the aspects pointed
> > out (copyright, quality and ethical ones). But at this point I guess we
> > might have more questions than answers, that's why I think it'd be
> > helpful to have some input before suggesting any concrete proposals.
> > Perhaps the most important step now is to get an idea of how Debian
> > folks actually feels about this matter.  And how we feel about moving in
> > a similar direction to what the gentoo project did.
> 
> I'm dubious of the Gentoo approach because it is (as they admit)
> unenforceable, which to me means that it's not a great policy.  A position
> statement, maybe, but that's a different sort of thing.

Right, note that they acknowledged this policy is a working in progress. Not
perfect, but 'something needed to be done, quickly'. It's hard to find a
balance here, but I kind of share this sense of urgency.

Also, I agree that a statement is indeed a more appropriate tool for the
circumstance. Although I mention Gentoo's policy, I acknowledge that I should
have worded better the title of my first message, as proposing a policy (in
Debian terms a policy is really a *Policy* and has huge implications!) didn't
reflect very well my intentions.

[...]

> About the only statement that I've wanted to make so far is to say that
> anyone relying on AI to summarize important project resources like Debian
> Policy or the Developers Guide or whatnot is taking full responsibility
> for any resulting failures.  If you ask an AI to read Policy for you and
> it spits out nonsense or lies, this is not something the Policy Editors
> have any time or bandwidth to deal with.

This point resonates with problems we might be facing already, for instance
in the NM process and also in Debconf submissions (there's no point of going
into details here because so far we can't proof anything, and even if we could,
of course we wouldn't bring any of the involved to the public arena). So I'm
actually more concerned about LLM being mindlessly applied in our communication
processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, etc)
than one using some AI-assisted code in our infra, at least for now.

Again, I correct myself and emphasize that I would rather discuss a possible
statement than a policy for Debian on this matter.

Bests,

--
tvaz



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread G. Branden Robinson
At 2024-05-02T16:31:39-0600, Sam Hartman wrote:
>> Generative AI tools **produce** derivatives of other people's
>> copyrighted works.
> 
>> That said, we already have the necessary policies in place:
> 
> Russ pointed out this is a fairly complicated claim.
> 
> It is absolutely true that generative AI models have produced output
> that contains copyrighted text.
> 
> The questions of whether that text is an infringing derivative of
> those copyrighted works are making their way through a number of law
> suits in my country at least.
>
> And as Russ points out the moral issues are going to be even harder to
> figure out.
> 
> I don't think it is as simple as you write above, and I agree with
> Russ's thoughts on the situation.

I would also note that some media organizations such as the Associated
Press have a really expansive view of what constitutes copyright
infringement.

Here's an example of the boilerplate the AP puts at the bottom of their
articles.

"Copyright 2024 The Associated Press. All rights reserved. This material
may not be published, broadcast, rewritten or redistributed."[1]

Published, broadcast, or redistributed--sure, these fall within the
domain of actions regulated by the Copyright Act.

But _rewritten_?

"You may not take the facts adduced here and express them in an original
manner, or we'll sue your ass."

Their legal position is that they have a moral right to extract rent
from you if you in any way benefit via the written word from knowledge
in your head that may have come to you through their reportage.

The copyright cartels will charge us fees for each and every thought we
think if they can just find a way to automate the process.  I can see
Disney, Apple, Comcast, and Springer-Verlag sitting around a conference
table, like Dick Cheney's "Energy Task Force" in the run up to the Iraq
War, carving the domains of human thought up like a map of Iraqi oil
fields, haggling over the boundaries between the sectors of topical
cogitation allocated to each firm, with nothing remaining at the end.

It's long past time their Bastille was stormed.  As with the gunpowder
stored in the building, I expect generative AI to be used by both sides.

Regards,
Branden

[1] 
https://www.usnews.com/news/business/articles/2024-03-12/the-new-york-times-is-fighting-off-wordle-look-alikes-with-copyright-takedown-notices


signature.asc
Description: PGP signature


Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Sam Hartman
> "Dominik" == Dominik George  writes:


Dominik> Generative AI tools **produce** derivatives of other people's 
copyrighted works.

Dominik> That said, we already have the necessary policies in place:

Russ pointed out this is a fairly complicated claim.

It is absolutely true that generative AI models have produced output
that contains copyrighted text.

The questions of whether that text is an infringing derivative of those
copyrighted works are making their way through a number of law suits in
my country at least.
And as Russ points out the moral issues are going to be even harder to
figure out.

I don't think it is as simple as you write above, and I agree with
Russ's thoughts on the situation.

--Sam



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Tiago Bortoletto Vaz


On Thu, May 02, 2024 at 08:47:31PM +0200, Dominik George wrote:
> Hi,
> 
> >It's just another tool that might or might not be non-free like people
> >using Photoshop, Google Chrome, Gmail, Windows, ... to make
> >contributions. Or a spamfilter to filter out some.
> 
> That's entirely not the point.
> 
> It is not about **the tool** being non-free, but the result of its use being 
> non-free.
> 
> Generative AI tools **produce** derivatives of other people's copyrighted 
> works.
> 
> That said, we already have the necessary policies in place:
> 
> * d/copyright must be accurate
> * all sources must be reproducible from their preferred form of modification
> 
> Both are not possible using generative AI.

That sounds right, but those policies are very related to Debian packages,
while Debian will also release a fair amount of other kinds of content.

Bests,

--
tvaz



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Mo Zhou

On 5/2/24 14:01, Tiago Bortoletto Vaz wrote:

You might already know that recently Gentoo made a strong move in this context
and drafted their AI policy:

- https://wiki.gentoo.org/wiki/Project:Council/AI_policy
- https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg99042.html


People might not already know I wrote this 4 years ago
https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Sam Hartman
> "Ansgar" == Ansgar   writes:

Ansgar> Hi,
Ansgar> On Thu, 2024-05-02 at 14:01 -0400, Tiago Bortoletto Vaz wrote:
>> I would like Debian to discuss and decide on the usage of AI-
>> generated content within the project.

Ansgar> It's just another tool that might or might not be non-free like 
people
Ansgar> using Photoshop, Google Chrome, Gmail, Windows, ... to make
Ansgar> contributions. Or a spamfilter to filter out some.

I tend to agree with the above.  AI is just another tool, and I trust
DDs to use it appropriately.

I probably would not use AI to write large blocks of code, because I
find that auditing the quality of AI generated code is harder than just
writing the code in most cases.

I might:

* use debgpt to guess answers to questions about packaging that I could
  verify in some manner.

* Use generative AI to suggest names of projects, help improve
  descriptions, summarize content, etc.

* See if generative AI could help producing a message with a desired
  tone.

--Sam



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Mo Zhou



On 5/2/24 14:47, Dominik George wrote:

That's entirely not the point.

It is not about **the tool** being non-free, but the result of its use being 
non-free.

Generative AI tools **produce** derivatives of other people's copyrighted works.


Yes. That includes the case where LLMs generates copyrighted contents with
large portions of overlap. For instance,
https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
because those copyrighted contents are a part of their original training 
dataset.

Apart from the LLMs (large language models), the image generation models and
other generative AIs will also do something similar, partly copying 
their copyrighted

training data to the generated results, to some extent.


That said, we already have the necessary policies in place:

* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification

Both are not possible using generative AI.


Both are possible.

For example, if a developer uses LLM to aid programming, and the LLM copied
some code from a copyrighted source. But the developer is very unlikely able
to tell whether the generated code contains verbatim copy of copyrighted
contents, lest the source of that copyrighted parts if any.

Namely, if we look at new software projects, we do not know whether the code
files are purely human-written, or with some aid from AI.

Similar things happens with other file types, such as images. For 
instance, you
may ask a generative AI to generate a logo, or some artworks as a part 
of a software
project. And those generated results, with or without further 
modifications, can be

stored in .ico, .jpg, and .png formats, etc.

Now, the problem is, FTP masters will not question the reproducibility of
a code file, or a .png file. If the upstream author does not acknowledge 
the use
of AI during the development process, it is highly likely that nobody 
else on

the earth will know that.

This does not sound like a situation where we can take any action to 
improve.
My only opinion towards this is to trust the upstream authors' 
acknowledgements.


BTW, ML-Policy has foreseen such issue and covered it to some extent:
https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst
See the "Generated Artifacts" section.

It seems that the draft Open Source AI Definition does not cover 
contents generated

by AI models yet:
https://discuss.opensource.org/t/draft-v-0-0-8-of-the-open-source-ai-definition-is-available-for-comments/315



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Russ Allbery
Tiago Bortoletto Vaz  writes:

> I personally agree with the author's rationale on the aspects pointed
> out (copyright, quality and ethical ones). But at this point I guess we
> might have more questions than answers, that's why I think it'd be
> helpful to have some input before suggesting any concrete proposals.
> Perhaps the most important step now is to get an idea of how Debian
> folks actually feels about this matter.  And how we feel about moving in
> a similar direction to what the gentoo project did.

I'm dubious of the Gentoo approach because it is (as they admit)
unenforceable, which to me means that it's not a great policy.  A position
statement, maybe, but that's a different sort of thing.

I also agree in part with Ansgar: we don't make policies against what
tools people use locally for developing software.

I think the piece that has the most direct impact on Debian is if the
output from the AI software is found to be a copyright infringement and
therefore something that Debian does not have permission to redistribute
or that violates the DFSG.  But we're going to be facing that problem with
upstreams as well, so the scope of that problem goes far beyond the
question of direct contributions to Debian, and I don't think direct
contributions to Debian will be the most significant part of that problem.

This is going to be a tricky and unsettled problem for some time, since
it's both legal (in multiple distributions) and moral, and it's quite
possible that the legal judgments will not align with moral judgments.
(Around copyright, this is often the case.)  I'm dubious of our ability to
get ahead of the legal process on this, given that it's unlikely that
we'll even be able to *detect* if upstreams are using AI.  I think this is
a place where it's better to plan on being reactive than to attempt to be
proactive.  If we get credible reports that software in Debian is not
redistributable under the terms of the DFSG, we should deal with that like
we would with any other DFSG violation.  That may involve making judgment
calls about the legality of AI-generated content, but hopefully this will
have settled out a bit in broader society before we're forced to make a
decision on a specific case.

I also doubt that there is much alignment within Debian about the morality
of copyright infringement in general.  We're a big-tent project from that
perspective.  Our project includes people who believe all software
copyright is an ill-advised legal construction that limits people's
freedom, and people who believe strongly in moral rights expressed through
copyright and in the right of an author to control how their work is used.
We could try to reach some sort of project consensus on the moral issues
here, but I'm a bit dubious we would be successful.

At the moment, my biggest concern about the practical impact of AI is that
most of the output is low-quality garbage and, because it's now automated,
the volume of that low-quality garbage can be quite high.  (I am
repeatedly assured by AI advocates that this will improve rapidly.  I
suppose we will see.  So far, the evidence that I've seen has just led me
to question the standards and taste of AI advocates.)  But I don't think
dealing with this requires any new *policies*.  I think it's a fairly
obvious point of Debian collaboration that no one should deluge their
fellow project members in low-quality garbage, and if that starts
happening, I think we have adequate mechanisms to complain and ask that it
stop without making new policy.

About the only statement that I've wanted to make so far is to say that
anyone relying on AI to summarize important project resources like Debian
Policy or the Developers Guide or whatnot is taking full responsibility
for any resulting failures.  If you ask an AI to read Policy for you and
it spits out nonsense or lies, this is not something the Policy Editors
have any time or bandwidth to deal with.

-- 
Russ Allbery (r...@debian.org)  



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Dominik George
Hi,

>It's just another tool that might or might not be non-free like people
>using Photoshop, Google Chrome, Gmail, Windows, ... to make
>contributions. Or a spamfilter to filter out some.

That's entirely not the point.

It is not about **the tool** being non-free, but the result of its use being 
non-free.

Generative AI tools **produce** derivatives of other people's copyrighted works.

That said, we already have the necessary policies in place:

* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification

Both are not possible using generative AI.

-nik



Re: A policy on use of AI-generated content in Debian

2024-05-02 Thread Ansgar 
Hi,

On Thu, 2024-05-02 at 14:01 -0400, Tiago Bortoletto Vaz wrote:
> I would like Debian to discuss and decide on the usage of AI-
> generated content within the project.

It's just another tool that might or might not be non-free like people
using Photoshop, Google Chrome, Gmail, Windows, ... to make
contributions. Or a spamfilter to filter out some.

> You might already know that recently Gentoo made a strong move in
> this context and drafted their AI policy:
> 
> - https://wiki.gentoo.org/wiki/Project:Council/AI_policy
> - https://www.mail-archive.com/gentoo-
> d...@lists.gentoo.org/msg99042.html
> 

I think that is a bad: policy we don't ban Tor because it can be used
copyright violations, terrorism, stalking, hacking or other unethical
things. We don't have a general ban on human contributions due to
quality concerns. I don't see why AI as yet another tool should be
different.

Ansgar



A policy on use of AI-generated content in Debian

2024-05-02 Thread Tiago Bortoletto Vaz
Hi,

I would like Debian to discuss and decide on the usage of AI-generated content
within the project. I fear that we are already facing negative consequences in
some areas of Debian as a result of its use. If I happen to be wrong, I'm still
afraid that it will happen in a very short time.

You might already know that recently Gentoo made a strong move in this context
and drafted their AI policy:

- https://wiki.gentoo.org/wiki/Project:Council/AI_policy
- https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg99042.html

I personally agree with the author's rationale on the aspects pointed out
(copyright, quality and ethical ones). But at this point I guess we might have
more questions than answers, that's why I think it'd be helpful to have some
input before suggesting any concrete proposals. Perhaps the most important step
now is to get an idea of how Debian folks actually feels about this matter.
And how we feel about moving in a similar direction to what the gentoo project
did.

If things move in a somewhat consensual direction, I intend to bring the
discussion to -vote in order to discuss a possible GR.

Bests,

-- 
tvaz


signature.asc
Description: PGP signature