Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Mo Zhou

On 5/3/24 12:10, Stefano Zacchiroli wrote:

On that front, useful "related work" are the policies that scientific
journals and conferences (which are exposed *a lot* to this, given their
main activity is vetting textual documents) have put in place about
this.

Indeed. Here are some examples:
Nature: https://www.nature.com/nature-portfolio/editorial-policies/ai
ICML: https://icml.cc/Conferences/2023/llm-policy
CVPR: https://cvpr.thecvf.com/Conferences/2024/ReviewerGuidelines
  https://cvpr.thecvf.com/Conferences/2024/AuthorGuidelines

Some additional points to the two from Stefano:
1. Nature does not allow LLM to be an author.
2. CVPR holds the author who used LLM responsible for all LLM's fault.
3. CVPR agrees that the paper reviewers skipping their work with LLM
    is harming the community.

The general policy usually contains two main points (paraphrased below):

(1) You are free to use AI tools to *improve* your content, but not to
 create it from scratch for you.

Polishing language is the case where I find LLMs most useful. But in fact,
as an author, when I really care about the quality of whatever I wrote,
I will find the state-of-the-art LLMs (such as ChatGPT4) poor in logic,
poor in understanding my deep insight. They eventually turn into a
smart language tutor to me.

(2) You need to disclose the fact you have used AI tools, and how you
 have used them.

Yes, It is commonly encouraged to acknowledge the use of AI tools.

Exactly as in your case, Tiago, people managing scientific journals and
conferences have absolutely no way of checking if these rules are
respected or not. (They have access to large-scale plagiarism detection
tools, which is a related but different concern.) They just ask people
to *state* they followed this policy upon submission, but that's it.

If the cheater who use LLM is lazy enough, not editing the LLM outputs
at all --- you will find it super easy to identify whether a chunk of text
is produced by LLM on your own. For example, I use ChatGPT basically 
everyday in

March, and its answers always feel like being organized in the same
format. No human answers questions in the same boring format all the time.

If your main concern is people using LLMs or the like in some of the
processes you mention, a checkbox requiring such a statement upon
submission might go a longer way than a project-wide statement (which
will sit in d-d-a unknown to n-m applicants a few years from now).

For the long run, there is no way to enforce a ban on the use of AI over
this project. What is doable, from my point of view, is to confirm that
a person acknowledges the issues, potential risk and implications of
the use of AI tools, and hold people who use AI to be responsible for
AI's fault.

Afterall, it's easy to identify one's intention of using AI -- it is either
for good or bad. If the NM applicants can easily get the answer of an
NM question, maybe it is time to refresh the question? Afterall nobody
can stop one from learning from AI outputs when they need suggestion
or reference answers -- and they are responsible for the wrong answer
if AI is wrong.

Apart from deliberately conducting bad acts using AIs, one thing that seems
benign but harmful to the community is slacking off and skipping important
work with AIs. But still, this can be covered by a single rule as well --
"Let the person who use AI to be responsible for AI's fault."

Simple, and doable.



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Sam Hartman
> "Tiago" == Tiago Bortoletto Vaz  writes:

Tiago> Hi Jose,
Tiago> Thanks for you input, I have a few comments:

Tiago> On Fri, May 03, 2024 at 11:02:47AM -0300, Jose-Luis Rivas wrote:
>> On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
>> > Right, note that they acknowledged this policy is a working in 
progress. Not
>> > perfect, but 'something needed to be done, quickly'. It's hard to find 
a
>> > balance here, but I kind of share this sense of urgency.
>> >
>> > [...]
>> >
>> > This point resonates with problems we might be facing already, for 
instance
>> > in the NM process and also in Debconf submissions (there's no point of 
going
>> > into details here because so far we can't proof anything, and even if 
we could,
>> > of course we wouldn't bring any of the involved to the public arena). 
So I'm
>> > actually more concerned about LLM being mindlessly applied in our 
communication
>> > processes (NM, bts, debconf, irc, planet, wiki, website, debian.net 
stuff, etc)
>> > than one using some AI-assisted code in our infra, at least for now.
>> >
>> 
>> Hi Tiago,
>> 
>> It seems you have more context than the rest which provides a sense of
>> urgency for you, where others do not have this same information and
>> can't share this sense of urgency.

Tiago> Yes.

Oh, wow, I had no idea that your argument for urgency came from the NM
case.

I actually think that NM is not benefitted from  a policy here.
We already have a fairly good standard: did you prove to your
application manager, your advocates, and the reviewers (FD or DAM as
appropriate) that you can be trusted and that you have the necessary
technical and other skills to be a DD.

I think it's fairly clear that using an LLM to answer questions in the
NM process does not show that you have the technical skills.
(Using it instead of reading a man page for similar results and then
going and doing the work might be fine, but cutting and pasting an
answer to an application question into the message you send to your AM
clearly doesn't demonstrate your own technical skill.)

I as an AM would find that an applicant using an LLM as more than a
possibly incorrect man page without telling me would violate trust.  I
don't need a policy to come to that conclusion.  I don't think I would
have any trouble convincing DAM or FD to back my decision.

I think coming up with a policy for this situation is going to be
tricky.

Do I mind an applicant asking an LLM to refresh their memory on how to
import a new upstream version?
No, not at all.
Do they need to cite the LLM in their answer?
If it really is a memory refresh and they know the material well enough
to  have confidence that the LLM answer is correct, I do not think they
need to cite.
If they don't know the material well enough to know the LLM is correct,
then and LLM is a bad choice.

But the same is true of a human I might ask.
If I asked you  to remind me  something about importing a new upstream,
and it really was just a reminder, I would not cite your contribution
unless I used a significant chunk of text you had written.
If you gave me bad info and I didn't catch it, then we learn I probably
should not be trusted to pick good sources for my education.

--Sam



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Russ Allbery
Stefano Zacchiroli  writes:

> (1) You are free to use AI tools to *improve* your content, but not to
> create it from scratch for you.

> This point is particular important for non-native English speakers,
> who can benefit a lot more than natives from tool support for tasks
> like proofreading/editing. I suspect the Debian community might be
> particularly sensible to this argument. (And note that on this one
> the barrier between ChatGPT-based proofreading and other grammar/
> style checkers will become more and more blurry in the future.)

This underlines a key point to me, which is that "AI" is a marketing term,
not a technical classification.  Even LLMs, a more technical
classification, can be designed to do different things, and I expect
hybrid models to become more widespread as the limitations of trying to do
literally everything via an LLM become more apparent.

Grammar checkers, automated translation, and autocorrect are all useful
tools in their appropriate place.  Some people have moral concerns about
how they're constructed and other people don't.  I'm not sure we'll have a
consensus on that.  So far, at least, there don't seem to be the sort of
legal challenges for those types of applications that there are for the
"write completely new text based on a prompt" tyle of LLM.

Just on a personal note, I do want to make a plea to non-native English
speakers to not feel like you need to replace your prose with something
generated by an LLM.

I don't want to understate the benefits of grammar checking, translation,
and other tools, and I don't want to underestimate the frustration and
difficulties in communicating in a non-native language.  I think ethical
tools to assist with that are great.  But I would much rather puzzle out
odd or less-than-fluent English, extend assumptions of good will, and work
through the occasional misunderstanding, if that means I can interact with
a real human voice.

I know, I know, supposedly this is all getting better, but so much of the
text produced by ChatGPT and similar tools today sounds like a McKinsey
consultant trying to sell war crimes to a marketing executive.  Yes, it's
precisely grammatical and well-structured English.  It's also sociopathic,
completely soulless, and almost impossible to concentrate on because it's
full of the sort of slippery phrases and opaque verbosity of a politician
trying to distract from some sort of major scandal.  I want to talk to
you, another human being, not to an LLM trained to sound like a corporate
web site.

-- 
Russ Allbery (r...@debian.org)  



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Tiago Bortoletto Vaz
Hi Jose,

Thanks for you input, I have a few comments:

On Fri, May 03, 2024 at 11:02:47AM -0300, Jose-Luis Rivas wrote:
> On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
> > Right, note that they acknowledged this policy is a working in progress. Not
> > perfect, but 'something needed to be done, quickly'. It's hard to find a
> > balance here, but I kind of share this sense of urgency.
> >
> > [...]
> >
> > This point resonates with problems we might be facing already, for instance
> > in the NM process and also in Debconf submissions (there's no point of going
> > into details here because so far we can't proof anything, and even if we 
> > could,
> > of course we wouldn't bring any of the involved to the public arena). So I'm
> > actually more concerned about LLM being mindlessly applied in our 
> > communication
> > processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, 
> > etc)
> > than one using some AI-assisted code in our infra, at least for now.
> >
> 
> Hi Tiago,
> 
> It seems you have more context than the rest which provides a sense of
> urgency for you, where others do not have this same information and
> can't share this sense of urgency.

Yes.

> If I were to assume based on the little context you shared, I would say
> there's someone doing a NM application using LLM, answering stuff with
> LLM and passing all their communications through LLMs.
> 
> In that case, there's even less point in making a policy about it, in my
> opinion. Since as you stated: you can't prove anything, and ultimately
> it would land in the hands of the people approving submissions or NMs to
> judge if the person is qualified or not. And you can't block
> communications from LLM generated content when you can't even prove it's
> LLM generated content. How to enforce it?

Hmm I tend to disagree here. Proving by investigation isn't the only way to get
some truth about the situation. We can get it by simply asking the person if
they used LLM to generate their work (be it an answer to NM questions, or a
contribution to Debian website, or an email to this mailing list...). In that
scenario, having a policy, a position statement or even a gentle guideline
would make a huge difference in the ongoing exchange.

> And I doubt a statement would do much, as well. What would be
> communicated? "Communications produced by LLMs are troublesome"? I don't
> know if there's much substance to have a statement of that sort.

Just to set the scene a little on how I think about the issue: when I brought
up this discussion, I didn't have in mind someone evil attempting to use AI to
deliberately disrupt the project. We know already that policies or statements
are never sufficient to deal with people in this category. Rather, I see many
people (mostly younger contributors) who're getting to use LLMs in their daily
life in a quite mindless way -- which of course is not our business if they do
so in their private life. However, the issues that can arise using this kind of
technology without much consideration in a community like Debian are not
obvious to everyone, and I don't expect every Debian contributor to have a
sufficiently good understanding of the matter, or maturity, at the moment they
start contributing to the project. We can draw some analogy here in relation to
the CoC and the Diversity Statement. They might seem quite obvious to some, and
less so to others.

So far I've felt a certain resistance to adopting something as sharp as Gentoo
did (which I've already agreed with). However, I still have the feeling that a
position in the form of a statement or even a guideline could help us both
avoid and mitigate possible problems in the future.

Bests,

--
tvaz



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Stefano Zacchiroli
On Thu, May 02, 2024 at 08:21:28PM -0400, Tiago Bortoletto Vaz wrote:
> So I'm actually more concerned about LLM being mindlessly applied in
> our communication processes (NM, bts, debconf, irc, planet, wiki,
> website, debian.net stuff, etc) than one using some AI-assisted code
> in our infra, at least for now.

On that front, useful "related work" are the policies that scientific
journals and conferences (which are exposed *a lot* to this, given their
main activity is vetting textual documents) have put in place about
this.

The general policy usually contains two main points (paraphrased below):

(1) You are free to use AI tools to *improve* your content, but not to
create it from scratch for you.

This point is particular important for non-native English speakers,
who can benefit a lot more than natives from tool support for tasks
like proofreading/editing. I suspect the Debian community might be
particularly sensible to this argument. (And note that on this one
the barrier between ChatGPT-based proofreading and other grammar/
style checkers will become more and more blurry in the future.)

(2) You need to disclose the fact you have used AI tools, and how you
have used them.

Exactly as in your case, Tiago, people managing scientific journals and
conferences have absolutely no way of checking if these rules are
respected or not. (They have access to large-scale plagiarism detection
tools, which is a related but different concern.) They just ask people
to *state* they followed this policy upon submission, but that's it.

If your main concern is people using LLMs or the like in some of the
processes you mention, a checkbox requiring such a statement upon
submission might go a longer way than a project-wide statement (which
will sit in d-d-a unknown to n-m applicants a few years from now).

Cheers
-- 
Stefano Zacchiroli . z...@upsilon.cc . https://upsilon.cc/zack  _. ^ ._
Full professor of Computer Science  o o   o \/|V|\/
Télécom Paris, Polytechnic Institute of Paris o o o   <\>
Co-founder & CTO Software Heritageo o o o   /\|^|/\
https://twitter.com/zacchiro . https://mastodon.xyz/@zacchiro   '" V "'



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Jose-Luis Rivas
On Thu May 2, 2024 at 9:21 PM -03, Tiago Bortoletto Vaz wrote:
> Right, note that they acknowledged this policy is a working in progress. Not
> perfect, but 'something needed to be done, quickly'. It's hard to find a
> balance here, but I kind of share this sense of urgency.
>
> [...]
>
> This point resonates with problems we might be facing already, for instance
> in the NM process and also in Debconf submissions (there's no point of going
> into details here because so far we can't proof anything, and even if we 
> could,
> of course we wouldn't bring any of the involved to the public arena). So I'm
> actually more concerned about LLM being mindlessly applied in our 
> communication
> processes (NM, bts, debconf, irc, planet, wiki, website, debian.net stuff, 
> etc)
> than one using some AI-assisted code in our infra, at least for now.
>

Hi Tiago,

It seems you have more context than the rest which provides a sense of
urgency for you, where others do not have this same information and
can't share this sense of urgency.

If I were to assume based on the little context you shared, I would say
there's someone doing a NM application using LLM, answering stuff with
LLM and passing all their communications through LLMs.

In that case, there's even less point in making a policy about it, in my
opinion. Since as you stated: you can't prove anything, and ultimately
it would land in the hands of the people approving submissions or NMs to
judge if the person is qualified or not. And you can't block
communications from LLM generated content when you can't even prove it's
LLM generated content. How to enforce it?

And I doubt a statement would do much, as well. What would be
communicated? "Communications produced by LLMs are troublesome"? I don't
know if there's much substance to have a statement of that sort.

OTOH, LLM-assisted rewrite of your own content may help non-native
English speakers to write better and improve communication
effectiveness. Hence, saying "communications produced by LLMs are
troublesome" would be troublesome itself, since how can you as a
receiver differentiate if it's their own content or other's content.

Some may say "a statement could at least be used as a pointer to say
'these are our expectations regarding use of AI'", but ultimately is in
the hands of those judging to filter out or not. And if those judging
can't even prove if AI was used, what's the point?

I can't see the point of "something needs to be done" without a clear
reasoning of the expectations out of that being done.

--Jose



Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Andrey Rakhmatullin
On Fri, May 03, 2024 at 01:04:29PM +0900, Charles Plessy wrote:
> If I would hear that other Debian developers use them in that context, I
> would seriously question whether there is any value to spend my
> volunteer time in keeping debian/copyright files accurate to the level
> of details our Policy asks for. 
There is a popular old opinion unrelated to AI that there is not.

-- 
WBR, wRAR


signature.asc
Description: PGP signature


Re: A policy on use of AI-generated content in Debian

2024-05-03 Thread Dominik George
Hi,

>> Generative AI tools **produce** derivatives of other people's copyrighted 
>> works.
>
>They *can* do that, but so can humans (and will). Humans look at a
>product or code and write new code that sometimes resembles the
>original very much.

Can I ask the LLM where it was probably inspired?

Can I show the LLM another work and ask it whether there might be a chance theu 
got inspired by it (and get a different answer than that it probably sucked in 
everything, so yes)?

Is there a chance that the LLM did not only read some source code, but also had 
friendly interactions with the author?

Admittedly, at that point, we get into philosophical questions, which I don't 
consider any less important for free software.

-nik