Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Hi, It's a good thing that https://wiki.gentoo.org/wiki/Project:Council/AI_policy has been voted, and that it mentions: > This motion can be revisited, should a case been made over such a tool that does not pose copyright, ethical and quality concerns. I wanted to provide some meat to discuss improvements of the specific phrasing "created with the assistance of Natural Language Processing artificial intelligence tools" which may not be the most optimal. First, I think we should not limit this to LLMs / NLP stuff, when it should be about all algorithmically/automatically generated content, which could all cause a flood of time-wasting, low-quality information. Second, I think we should define what would be acceptable use cases of algorithmically-generated content; I'd suggest for a starting point, the combination of: - The algorithm generating such content is proper F/LOSS - In the case of a machine learning algorithm, the dataset allowing to generate such algorithm is proper F/LOSS itself (with traceability of all of its bits) - The algorithm generating such content is reproducible (training produces the exact same bits) - The algorithm did not publish the content automatically: all the content was reviewed and approved by a human, who bears responsibility for their contribution, and the content has been flagged as having been generated using $tool. Third, I think a "developer certificate of origin" policy could be augmented with the "bot did not publish the content automatically" bits and should also be mandated in the context of bug reporting, so as to have a "human gate" for issues discovered by automation / tinderboxes. Best regards, -- Jérôme signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, 2024-02-27 at 15:45 +0100, Michał Górny wrote: > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. Since I've been asked to flesh out a specific motion, here's what I propose specifically: """ It is expressly forbidden to contribute to Gentoo any content that has been created with the assistance of Natural Language Processing artificial intelligence tools. This motion can be revisited, should a case been made over such a tool that does not pose copyright, ethical and quality concerns. """ This explicitly covers all GPTs, including ChatGPT and Copilot, which is the category causing the most concern at the moment. At the same time, it doesn't block more specific uses of machine learning to problem solving. Special thanks to Arthur Zamarin for consulting me on this. -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Fri, 2024-03-01 at 07:06 +, Sam James wrote: > Another person approached me after this RFC and asked whether tooling > restricted to the current repo would be okay. For me, that'd be mostly > acceptable, given it won't make suggestions based on copyrighted code. I think an important question is: how is it restricted? Are we talking about a tool that was clearly trained on specific code, or about a tool that was trained on potentially copyright material, then artificially restricted to the repository (to paper over the concerns)? Can we trust the latter? -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, 2024-02-27 at 18:04 +, Sam James wrote: > I'm a bit worried this is slightly performative - which is not a dig at > you at all - given we can't really enforce it, and it requires honesty, > but that's also not a reason to not try ;) I don't think it's really possible or feasible to reliably detect such contributions, and even if it were, I don't think we want to go as far as to actively pursue anything that looks like one. The point of the policy is rather to make a statement that we don't want these, and to kindly ask users not to do that. -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
El 27/2/24 a las 15:45, Michał Górny escribió: Hello, Given the recent spread of the "AI" bubble, I think we really need to look into formally addressing the related concerns. In my opinion, at this point the only reasonable course of action would be to safely ban "AI"-backed contribution entirely. In other words, explicitly forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to create ebuilds, code, documentation, messages, bug reports and so on for use in Gentoo. Just to be clear, I'm talking about our "original" content. We can't do much about upstream projects using it. I think it would be a big mistake, because in the end we are not shooting in the foot (I use translate, it doesn't mean the same thing in English anyway) In the end it is a helping tool and in the end there is always human intervention to finish the job. In the end we are going to have to live with AIs in all the environments of our lives. The sooner we know how to manage them, the more productive we will be. Rationale: 1. Copyright concerns. At this point, the copyright situation around generated content is still unclear. What's pretty clear is that pretty much all LLMs are trained on huge corpora of copyrighted material, and all fancy "AI" companies don't give shit about copyright violations. In particular, there's a good risk that these tools would yield stuff we can't legally use. 2. Quality concerns. LLMs are really great at generating plausibly looking bullshit. I suppose they can provide good assistance if you are careful enough, but we can't really rely on all our contributors being aware of the risks. 3. Ethical concerns. As pointed out above, the "AI" corporations don't give shit about copyright, and don't give shit about people. The AI bubble is causing huge energy waste. It is giving a great excuse for layoffs and increasing exploitation of IT workers. It is driving enshittification of the Internet, it is empowering all kinds of spam and scam. Gentoo has always stood out as something different, something that worked for people for whom mainstream distros were lacking. I think adding "made by real people" to the list of our advantages would be a good thing — but we need to have policies in place, to make sure shit doesn't flow in. Compare with the shitstorm at: https://github.com/pkgxdev/pantry/issues/5358
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tuesday, February 27th, 2024 at 3:45 PM, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. > > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. > > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. > > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. > > > Gentoo has always stood out as something different, something that > worked for people for whom mainstream distros were lacking. I think > adding "made by real people" to the list of our advantages would be > a good thing — but we need to have policies in place, to make sure shit > doesn't flow in. > > Compare with the shitstorm at: > https://github.com/pkgxdev/pantry/issues/5358 > > -- > Best regards, > Michał Górny While I understand the concerns that may have triggered feeling the need for a rule like this. As someone from the field of machine learning (AI) engineer, I feel I need to add my brief opinion. The pkgxdev thing very artificial and if there is a threat to quality/integrity it will not manifest itself as obviously which brings me to.. A rule like this is just not enforceable. The contributor as they're signed is responsible for the quality of the contribution, even if it's been written by plain editor, dev environment with smart plugins (LSP) or their dog. Other organizations have already had to deal with automated contributions which can sometimes go wrong for *all different* kinds of reasons for much longer and their approach may be an inspiration: [0] OpenStreetMap: automated edits - https://wiki.openstreetmap.org/wiki/Automated_Edits_code_of_conduct [1] Wikipedia: bot policy - https://en.wikipedia.org/wiki/Wikipedia:Bot_policy The AI that we are dealing right now is just another means of automation after all. As a machine learning engineer myself, I was contemplating creating an instance of a generative model myself for my own use from my own data, in which case the copyright and ethical point would absolutely not apply. Also, there are ethically and copyright-ok language model projects such as project Bergamo [2] vetted by universities and EU, also used by [3] Mozilla (one of the prominent ethical AI proponents). Banning all tools, just because some might be not up to moral standards, puts the ones that are, in a disadvantage in our world as a whole. [2] Project Bergamo - https://browser.mt/ [3] Mozilla blog: training translation models - https://hacks.mozilla.org/2022/06/training-efficient-neural-network-models-for-firefox-translations/ - Martin
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, Mar 05, 2024 at 06:12:06 +, Robin H. Johnson wrote: > At the top, I noted that it will be possible in future for AI generation > to be used in a good, safe way, and we should provide some signals to > the researchers behind the AI industry on this matter. > > What should it have? > - The output has correct license & copyright attributions for portions that > are copyrightable. > - The output explicitly disclaims copyright for uncopyrightable portions > (yes, this is a higher bar than we set for humans today). > - The output is provably correct (QA checks, actually running tests etc) > - The output is free of non-functional/nonsense garbage. > - The output is free of hallucinations (aka don't invent dependencies that > don't exist). > > Can you please contribute other requirements that you feel "good" AI output > should have? > - The output is not overly clever even if correct. It should resemble something a reasonable human might write. For example, some contrived sequence of Bash parameter expansions vs using sed. - The output is succinct enough. This continues the "reasonable human" theme from above. For example, it should not first increment some value by 4, then 3, then 2, and finaly 1 when incrementing by 10 right off the bat makes more sense. - The output domain is able to be restricted in some form. Given a problem, some things are simply outside of the space of valid answers. For example, sudo rm -rf --no-preserve-root / should not be a line that can be generated in the context of ebuilds. - Simply enumerating restrictions should be considered intractable. While it may be trivial to create a list of forbidden words in the context of a basic family-friendly environment, how can you effectively guard against forbidden constructs when you might not know them all beforehand? For example, how do you define what constitutes "malicious output"? - Oskari signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
(Full disclosure: I presently work for a non-FAANG cloud company with a primary business focus in providing GPU access, for AI & other workloads; I don't feel that is a conflict of interest, but understand that others might not feel the same way). Yes, we need to formally address the concerns. However, I don't come to the same conclusion about an outright ban. I think we need to: 1. Short-term, clearly point out why much of the present outputs would violate existing policies. Esp. the low-grade garbage output. 2. Short & medium-term: a time-limited policy saying "no AI-backend works temporarily, while waiting for legal precedent", which clear guidelines about what is being the blocking deal. 3. Longer-term, produce a policy that shows how AI generation can be used for good, in a safe way**. 4. Keep the human in the loop; no garbage reinforcing garbage. Further points inline. On Tue, Feb 27, 2024 at 03:45:17PM +0100, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. Are there footholds where you see AI tooling would be acceptable to you today? AI-summarization of inputs, if correct & free of hallucinations, is likely to be of immediate value. I see this coming up in terms of analyzing code backtraces as well as better license analysis tooling. The best tools here include citations that should be verified as to why the system thinks the outcome is correct: buyer-beware if you don't verify the citations. > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. The Gentoo Foundation (and SPI) are both US legal entities. That means at least abiding by US copyright law... As of writing this, the present US Copyright office says AI-generated works are NOT eligible for their *own* copyright registration. The outputs are either un-copyrightable or if they are sufficiently similarly to existing works, that original copyright stands (with license and authorship markings required). That's going to be a problem if the EU, UK & other major WIPO members come to a different conclusion, but for now, as a US-based organization, Gentoo has the rules it must follow. The fact that it *might* be uncopyrightable, and NOT tagged as such gives me equal concern to the missing attribution & license statements. Enough untagged uncopyrightable material present MAY invalidate larger copyrights. Clearer definitions about the distinction between public domain vs uncopyrightable are also required in our Gentoo documentation (at a high level ineligible vs not copyrighted vs expired vs laws/acts-of-government vs works-of-government, but there is nuance). > > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. 100% agree; The quality of output is the largest concern *right now*. The consistency of output is strongly related: given similar inputs (including best practices not changing over time), it should give similar outputs. How good must the output be to negate this concern? Current-state-of-the-art can probably write ebuilds with fewer QA violations than most contributors, esp. given automated QA checking tools for a positive reinforcement loop. Besides the actual output being low-quality, the larger problem is that users submitting it don't realize that it's low-quality (or in a few cases don't care). Gentoo's existing policies may only need tweaks & re-iteration here. - GLEP76 does not set out clear guidelines for uncopyrightable works. - GLEP76 should have a clarification that asserting GCO/DCO over AI-generated works at this time is not acceptable. > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. Is an ethical AI entity possible?
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Matt Jolly writes: >> But where do we draw the line? Are translation tools like DeepL >> allowed? I don't see much of a copyright issue for these. > > I'd also like to jump in and play devil's advocate. There's a fair > chance that this is because I just got back from a > supercomputing/research conf where LLMs were the hot topic in every keynote. > > As mentioned by Sam, this RFC is performative. Any users that are going > to abuse LLMs are going to do it _anyway_, regardless of the rules. We > already rely on common sense to filter these out; we're always going to > have BS/Spam PRs and bugs - I don't really think that the content being > generated by LLM is really any worse. > > This doesn't mean that I think we should blanket allow poor quality LLM > contributions. It's especially important that we take into account the > potential for bias, factual errors, and outright plagarism when these > tools are used incorrectly. We already have methods for weeding out low > quality contributions and bad faith contributors - let's trust in these > and see what we can do to strengthen these tools and processes. > > A bit closer to home for me, what about using a LLMs as an assistive > technology / to reduce boilerplate? I'm recovering from RSI - I don't > know when (if...) I'll be able to type like I used to again. If a model > is able to infer some mostly salvagable boilerplate from its context > window I'm going to use it and spend the effort I would writing that to > fix something else; an outright ban on LLM use will reduce my _ability_ > to contribute to the project. Another person approached me after this RFC and asked whether tooling restricted to the current repo would be okay. For me, that'd be mostly acceptable, given it won't make suggestions based on copyrighted code. I also don't have a problem with LLMs being used to help refine commit messages as long as someone is being sensible about it (e.g. if, as in your situation, you know what you want to say but you can't type much). I don't know how to phrase a policy off the top of my head which allows those two things but not the rest. > > What about using a LLM for code documentation? Some models can do a > passable job of writing decent quality function documentation and, in > production, I _have_ caught real issues in my logic this way. Why should > I type that out (and write what I think the code does rather than what > it actually does) if an LLM can get 'close enough' and I only need to do > light editing? I suppose in that sense, it's the same as blindly listening to any linting tool or warning without understanding what it's flagging and if it's correct. > [...] > As a final not-so-hypothetical, what about a LLM trained on Gentoo docs > and repos, or more likely trained on exclusively open-source > contributions and fine-tuned on Gentoo specifics? I'm in the process of > spinning up several models at work to get a handle on the tech / turn > more electricity into heat - this is a real possibility (if I can ever > find the time). I think that'd be interesting. It also does a good job as a rhetorical point wrt the policy being a bit too blanket here. See https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/ too. > > The cat is out of the bag when it comes to LLMs. In my real-world job I > talk to scientists and engineers using these things (for their > strengths) to quickly iterate on designs, to summarise experimental > results, and even to generate testable hypotheses. We're only going to > see increasing use of this technology going forward. > > TL;DR: I think this is a bad idea. We already have effective mechanisms > for dealing with spam and bad faith contributions. Banning LLM use by > Gentoo contributors at this point is just throwing the baby out with the > bathwater. The problem is that in FOSS, a lot of people are getting flooded with AI spam and therefore have little regard for any possibly-good parts of it. I count myself as part of that group - it's very much sludge and I feel tired just seeing it talked about at the moment. Is that super rational? No, but we're also volunteers and it's not unreasonable for said volunteers to then say "well I don't want any more of that". I think this colours a lot of the responses here, and it doesn't invalidate them, but it also explains why nobody is really interested in being open to this for now. Who can blame them (me included)? > > As an alternative I'd be very happy some guidelines for the use of LLMs > and other assistive technologies like "Don't use LLM code snippets > unless you understand them", "Don't blindly copy and paste LLM output", > or, my personal favourite, "Don't be a jerk to our poor bug wranglers". > > A blanket "No completely AI/LLM generated works" might be fine, too. > > Let's see how the legal issues shake out before we start pre-emptively > banning useful tools. There's a lot of ongoing action in this space - at > the very least I'd
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Hi, > Compare with the shitstorm at: > https://github.com/pkgxdev/pantry/issues/5358 Thank you for this, it made my day. Though I'm just a proxy maintainer for now, I also support this initiative, there should be some guard rails set up around LLM usage. > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. IANAL, but IMHO if we stop respecting copyright law, even if indirectly via LLMs, why should we expect others to respect our licenses? It could be prudent to wait and see where will this land. > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. From my personal experience of using Github Copilot fine tuned on a large private code base, it functions mostly okay as a more smart auto complete on a single line of code, but when it comes to multiple lines of code, even when it comes to filling out boiler plate code, it's at best a 'meh'. The problem is that while the output looks okay-ish, often it will have subtle mistakes or will hallucinate some random additional stuff not relevant to the source file in question, so one ends up having to read and analyze the entire output of the LLM to fix problems with the code. I found that the mental and time overhead rarely makes it worth it, especially when a template can do a better job (e.g. this would be the case for ebuilds). Since during reviews we are supposed to be reading the entire contribution, not sure how much difference this makes, but I can see a developer trusting LLM too much might end up outsourcing the checking of the code to the reviewers, which means we need to be extra vigilant and could lead to reduced trust of contributions. > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. I agree. I'm already tired of AI generated blog spam and so forth, such a waste of time and quite annoying. I'd rather not have that on our wiki pages too. The purpose of documenting things is to explain an area to someone new to it or writing down unique quirks of a setup or a system. Since LLMs cannot write new original things, just rehash information it has seen I'm not sure how could it be helpful for this at all to be honest. Overall my time is too valuable to shift through AI generated BS when I'm trying to solve a problem, I'd prefer we keep a well curated high quality documentation where possible. Zoltan signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On 2/28/24 6:06 AM, Matt Jolly wrote: > >> But where do we draw the line? Are translation tools like DeepL >> allowed? I don't see much of a copyright issue for these. > > I'd also like to jump in and play devil's advocate. There's a fair > chance that this is because I just got back from a > supercomputing/research conf where LLMs were the hot topic in every > keynote. > > As mentioned by Sam, this RFC is performative. Any users that are going > to abuse LLMs are going to do it _anyway_, regardless of the rules. We > already rely on common sense to filter these out; we're always going to > have BS/Spam PRs and bugs - I don't really think that the content being > generated by LLM is really any worse. > > This doesn't mean that I think we should blanket allow poor quality LLM > contributions. It's especially important that we take into account the > potential for bias, factual errors, and outright plagarism when these > tools are used incorrectly. We already have methods for weeding out low > quality contributions and bad faith contributors - let's trust in these > and see what we can do to strengthen these tools and processes. Why is this an argument *against* performative statement of intent? There are too many ways for bad faith contributors to maliciously engage with the community, and no one is proposing a need to lay down rules that forbid such people. It is meaningful on its own to specify good faith rules that people should abide by in order to produce a smoother experience. And telling people that they are not supposed to do XXX is a good way to reduce the amount of low quality contributions that Devs need to sift through... > A bit closer to home for me, what about using a LLMs as an assistive > technology / to reduce boilerplate? I'm recovering from RSI - I don't > know when (if...) I'll be able to type like I used to again. If a model > is able to infer some mostly salvagable boilerplate from its context > window I'm going to use it and spend the effort I would writing that to > fix something else; an outright ban on LLM use will reduce my _ability_ > to contribute to the project. So by this appeal to emotion, you can claim anything is assistive technology and therefore should be allowed because it's discriminatory against the disabled if you don't allow it? Is there some special attribute of disabled persons that means they are exempted from copyright law? What counts as assistive technology? Is it any technology that disabled persons use, or technology designed to bridge the gap for the disabled? If a disabled person uses vim because shortcuts, does that mean vim is "assistive technology" because someone used it to "assist" them? ... I somehow feel like I maybe heard about assistive technology existing that assisted disabled persons in the process of dictating their thoughts while avoiding physically stressful typing activities. It didn't involve having the "assistive technology" provide both the content and the typing, as that's not really *assisting*. > In line with the above, if the concern is about code quality / potential > for plagiarised code, What about indirect use of LLMs? Imagine a > hypothetical situation where a contributor asks a LLM to summarise a > topic and uses that knowledge to implement a feature. Is this now > tainted / forbidden knowledge according to the Gentoo project? Since your imagined hypothetical involves the use of copyrighted works by and from a person, which cannot be said to be derivative copyrighted works of the training data from the LLM -- for the same reason that reading an article in a handwritten, copyrighted journal about "a topic" to learn about that topic and then writing software based on the ideas from the article is not a *derivative copyrighted work* -- the answer is extremely trivially no? The copyright issue with LLMs isn't that they ingest blogposts about how cool ebuilds are and use that knowledge to write ebuilds. The copyright issue with LLMs is that they ingest github repos full of non-Gentoo ebuilds copyrighted under who knows what license and then regurgitate those ebuilds. It is *derivative works*. Prose summaries about generic topics is a good way to break the link when it comes to derived works, it doesn't have anything to do with LLMs. Nonetheless, any credible form of scholarship is going to demand that participants be well versed in where the line is between saying something in your own words with citation, and plagiarism. > As a final not-so-hypothetical, what about a LLM trained on Gentoo docs > and repos, or more likely trained on exclusively open-source > contributions and fine-tuned on Gentoo specifics? I'm in the process of > spinning up several models at work to get a handle on the tech / turn > more electricity into heat - this is a real possibility (if I can ever > find the time). If you can state for a fact that you have done so, then clearly it's not a copyright violation. "exclusively
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Wed, Feb 28, 2024 at 1:50 PM Arthur Zamarin wrote: > > I know that GitHub Copilot can be limited to licenses, and even to just > the current repository. Even though, I'm not sure that the copyright can > be attributed to "me" and not the "AI" - so still gray area. So, AI copyright is a bit of a poorly defined area simply due to a lack of case law. I'm not all that confident that courts won't make an even bigger mess of it. There are half a dozen different directions I think a court might rule on the matter of authorship and derived works, but I think it is VERY unlikely that a court will rule that the copyright will be attributed to the AI itself, or that the AI itself ever was an author or held any legal rights to the work at any point in time. An AI is not a legal entity. The company that provides the service, its employees/developers, the end user, and the authors and copyright holders of works used to train the AI are all entities a court is likely to consider as having some kind of a role. That said, we live in a world where it isn't even clear if APIs can be copyrighted, though in practice enforcing such a copyright might be impossible. It could be a while before AI copyright concerns are firmly settled. When they are, I suspect it will be done in a way that frustrates just about everybody on every side... IMO the main risk to an organization (especially a transparent one like ours) from AI code isn't even whether it is copyrightable or not, but rather getting pulled into arguments and debates and possibly litigation over what is likely to be boilerplate code that needs a lot of cleanup anyway. Even if you "win" in court or the court of public opinion, the victory can be pyrrhic. -- Rich
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On 27/02/2024 16.45, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. I support this motion. > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. I know that GitHub Copilot can be limited to licenses, and even to just the current repository. Even though, I'm not sure that the copyright can be attributed to "me" and not the "AI" - so still gray area. > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. Let me tell a story. I was interested if I can teach an LLM the ebuild format, as a possible helper tool for devs/non-devs. My prompt got so huge, where I was teaching it all the stuff of ebuilds, where to input the source code (eclasses), and such. At one point, it even managed to output a close enough python distutils-r1 ebuild - the same level that `vim dev-python/${PN}/${PN}-${PV}.ebuild` creates using the gentoo template. Yes, my long work resulted in no gain. For each other ebuild type: cmake, meson, go, rust - I always got garbage ebuild. Yes, it was generating a good DESCRIPTION and HOMEPAGE (simple stuff to copy from upstream) and even 60% accuracy for LICENSE. But did you know we have "intel80386" arch for KEYWORDS? We can RESTRICT="install"? We can use "^cat-pkg/pkg-1" syntax in deps? PATCHES with http urls inside? And the list goes on. Sometimes it was even funny. So until a good prompt can be created for gentoo, upon which we *might* reopen discussion, I'm strongly supporting banning AI generating ebuilds. Currently good templates per category, and just copying other ebuilds as starting point, and even just skel.ebuild - all those 3 options bring much better result and less time waste for developers. > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. > Many companies who use AI as reason for layoff are just creating a reasoning out of bad will, or ignorance. The company I work at is using AI tools as a boost for productivity, but at all levels of management they know that AI can't replace a person - best case boost him 5-10%. The current real reason for layoffs is tightening of budget movement cross the industry (just a normal cycle, soon it would get better), so management prefer to layoff not themselves. So yeah, sad world. > > Gentoo has always stood out as something different, something that > worked for people for whom mainstream distros were lacking. I think > adding "made by real people" to the list of our advantages would be > a good thing — but we need to have policies in place, to make sure shit > doesn't flow in. > > Compare with the shitstorm at: > https://github.com/pkgxdev/pantry/issues/5358 > Great read, really much WTF. This whole repo is just a cluster of AIs competing against each other. -- Arthur Zamarin arthur...@gentoo.org Gentoo Linux developer (Python, pkgcore stack, Arch Teams, GURU) OpenPGP_signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Wed, 2024-02-28 at 11:08 +0100, Ulrich Mueller wrote: > > > > > > On Wed, 28 Feb 2024, Michał Górny wrote: > > > On Tue, 2024-02-27 at 21:05 -0600, Oskari Pirhonen wrote: > > > What about cases where someone, say, doesn't have an excellent grasp of > > > English and decides to use, for example, ChatGPT to aid in writing > > > documentation/comments (not code) and puts a note somewhere explicitly > > > mentioning what was AI-generated so that someone else can take a closer > > > look? > > > > > > I'd personally not be the biggest fan of this if it wasn't in something > > > like a PR or ml post where it could be reviewed before being made final. > > > But the most impportant part IMO would be being up-front about it. > > > I'm afraid that wouldn't help much. From my experiences, it would be > > less effort for us to help writing it from scratch, than trying to > > untangle whatever verbose shit ChatGPT generates. Especially that > > a person with poor grasp of the language could have trouble telling > > whether the generated text is actually meaningful. > > But where do we draw the line? Are translation tools like DeepL allowed? > I don't see much of a copyright issue for these. I have a strong suspicion that these translation tools are trained on copyrighted translations of books and other copyrighted material. -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
But where do we draw the line? Are translation tools like DeepL allowed? I don't see much of a copyright issue for these. I'd also like to jump in and play devil's advocate. There's a fair chance that this is because I just got back from a supercomputing/research conf where LLMs were the hot topic in every keynote. As mentioned by Sam, this RFC is performative. Any users that are going to abuse LLMs are going to do it _anyway_, regardless of the rules. We already rely on common sense to filter these out; we're always going to have BS/Spam PRs and bugs - I don't really think that the content being generated by LLM is really any worse. This doesn't mean that I think we should blanket allow poor quality LLM contributions. It's especially important that we take into account the potential for bias, factual errors, and outright plagarism when these tools are used incorrectly. We already have methods for weeding out low quality contributions and bad faith contributors - let's trust in these and see what we can do to strengthen these tools and processes. A bit closer to home for me, what about using a LLMs as an assistive technology / to reduce boilerplate? I'm recovering from RSI - I don't know when (if...) I'll be able to type like I used to again. If a model is able to infer some mostly salvagable boilerplate from its context window I'm going to use it and spend the effort I would writing that to fix something else; an outright ban on LLM use will reduce my _ability_ to contribute to the project. What about using a LLM for code documentation? Some models can do a passable job of writing decent quality function documentation and, in production, I _have_ caught real issues in my logic this way. Why should I type that out (and write what I think the code does rather than what it actually does) if an LLM can get 'close enough' and I only need to do light editing? In line with the above, if the concern is about code quality / potential for plagiarised code, What about indirect use of LLMs? Imagine a hypothetical situation where a contributor asks a LLM to summarise a topic and uses that knowledge to implement a feature. Is this now tainted / forbidden knowledge according to the Gentoo project? As a final not-so-hypothetical, what about a LLM trained on Gentoo docs and repos, or more likely trained on exclusively open-source contributions and fine-tuned on Gentoo specifics? I'm in the process of spinning up several models at work to get a handle on the tech / turn more electricity into heat - this is a real possibility (if I can ever find the time). The cat is out of the bag when it comes to LLMs. In my real-world job I talk to scientists and engineers using these things (for their strengths) to quickly iterate on designs, to summarise experimental results, and even to generate testable hypotheses. We're only going to see increasing use of this technology going forward. TL;DR: I think this is a bad idea. We already have effective mechanisms for dealing with spam and bad faith contributions. Banning LLM use by Gentoo contributors at this point is just throwing the baby out with the bathwater. As an alternative I'd be very happy some guidelines for the use of LLMs and other assistive technologies like "Don't use LLM code snippets unless you understand them", "Don't blindly copy and paste LLM output", or, my personal favourite, "Don't be a jerk to our poor bug wranglers". A blanket "No completely AI/LLM generated works" might be fine, too. Let's see how the legal issues shake out before we start pre-emptively banning useful tools. There's a lot of ongoing action in this space - at the very least I'd like to see some thorough discussion of the legal issues separately if we're making a case for banning an entire class of technology. A Gentoo LLM project formed of experts who could actually provide good advice / some actual guidelines for LLM use within the project (and engaging some real-world legal advice) might be a good starting point. Are there any volunteers in the audience? Thanks for listening to my TED talk, Matt OpenPGP_0x50EC548D52E051C0.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, 2024-02-27 at 15:45 +0100, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on > for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't > do > much about upstream projects using it. > > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that > pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff > we > can't legally use. > > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you > are > careful enough, but we can't really rely on all our contributors being > aware of the risks. > > 3. Ethical concerns. As pointed out above, the "AI" corporations > don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. > > > Gentoo has always stood out as something different, something that > worked for people for whom mainstream distros were lacking. I think > adding "made by real people" to the list of our advantages would be > a good thing — but we need to have policies in place, to make sure > shit > doesn't flow in. > > Compare with the shitstorm at: > https://github.com/pkgxdev/pantry/issues/5358 > +1 Can we get this added to the agenda for the next council meeting?
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
> On Wed, 28 Feb 2024, Michał Górny wrote: > On Tue, 2024-02-27 at 21:05 -0600, Oskari Pirhonen wrote: >> What about cases where someone, say, doesn't have an excellent grasp of >> English and decides to use, for example, ChatGPT to aid in writing >> documentation/comments (not code) and puts a note somewhere explicitly >> mentioning what was AI-generated so that someone else can take a closer >> look? >> >> I'd personally not be the biggest fan of this if it wasn't in something >> like a PR or ml post where it could be reviewed before being made final. >> But the most impportant part IMO would be being up-front about it. > I'm afraid that wouldn't help much. From my experiences, it would be > less effort for us to help writing it from scratch, than trying to > untangle whatever verbose shit ChatGPT generates. Especially that > a person with poor grasp of the language could have trouble telling > whether the generated text is actually meaningful. But where do we draw the line? Are translation tools like DeepL allowed? I don't see much of a copyright issue for these. Ulrich [1] https://www.deepl.com/translator signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, 2024-02-27 at 21:05 -0600, Oskari Pirhonen wrote: > What about cases where someone, say, doesn't have an excellent grasp of > English and decides to use, for example, ChatGPT to aid in writing > documentation/comments (not code) and puts a note somewhere explicitly > mentioning what was AI-generated so that someone else can take a closer > look? > > I'd personally not be the biggest fan of this if it wasn't in something > like a PR or ml post where it could be reviewed before being made final. > But the most impportant part IMO would be being up-front about it. I'm afraid that wouldn't help much. From my experiences, it would be less effort for us to help writing it from scratch, than trying to untangle whatever verbose shit ChatGPT generates. Especially that a person with poor grasp of the language could have trouble telling whether the generated text is actually meaningful. -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, Feb 27, 2024 at 15:45:17 +0100, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. > I agree. But for the sake of discussion: What about cases where someone, say, doesn't have an excellent grasp of English and decides to use, for example, ChatGPT to aid in writing documentation/comments (not code) and puts a note somewhere explicitly mentioning what was AI-generated so that someone else can take a closer look? I'd personally not be the biggest fan of this if it wasn't in something like a PR or ml post where it could be reviewed before being made final. But the most impportant part IMO would be being up-front about it. > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. > I really dislike the lack of audit trail for where the bits and pieces come from. Not to mention the examples from early on where Copilot was filling in incorrect attribution. - Oskari signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On 2/27/24 9:45 AM, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. No constructive or valuable contributions will fall afoul of the new ban. Seems reasonable to me. -- Eli Schwartz OpenPGP_0x84818A6819AF4A9B.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Am Dienstag, 27. Februar 2024, 18:50:15 CET schrieb Roy Bamford: > On 2024.02.27 14:45, Michał Górny wrote: > > Hello, > > > > [...] > > > > Gentoo has always stood out as something different, something that > > worked for people for whom mainstream distros were lacking. I think > > adding "made by real people" to the list of our advantages would be > > a good thing — but we need to have policies in place, to make sure > > shit > > doesn't flow in. > > > > Compare with the shitstorm at: > > https://github.com/pkgxdev/pantry/issues/5358 > > Michał, > > An excellent piece of prose setting out the rationale. > I fully support it. I would like to add the following: Last year we had a chatbot in our Gentoo forum that posted 76 posts on 2024-12-19. An inexperienced moderator (me) then asked his colleagues on the basis of which forum rules we can ban this chatbot: "Do we have a rule somewhere that an AI and a chatbot are not allowed to log in? I have read our Guideĺines ( https://forums.gentoo.org/viewtopic-t-525.html ) and found no such prohibition. On what basis could we even block a chatbot ?" The answer from two experienced colleagues was that this is already covered by our forum rules, because chatbots usually cannot (yet) fulfill the requirements of a forum post and therefore violate our Guideĺines. To be honest, I asked myself at the time what would happen if we had a clearly recognizable AI as a user that made (reasonably) sensible posts. We would then have no chance of banning this AI user without an explicit prohibition. I would be much more comfortable if we clearly communicated that we do not accept an AI as a user. Yes, I would also be very happy to see this proposal implemented. -- Best regards, Peter (aka pietinger)
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On 24/02/27 07:07PM, Ulrich Mueller wrote: > > On Tue, 27 Feb 2024, Rich Freeman wrote: > > > On Tue, Feb 27, 2024 at 9:45 AM Michał Górny wrote: > >> > >> Given the recent spread of the "AI" bubble, I think we really need to > >> look into formally addressing the related concerns. > > First of all, I fully support mgorny's proposal. > > >> 1. Copyright concerns. > > > I do think it makes sense to consider some of this. > > > However, I feel like the proposal is redundant with the existing > > requirement to signoff on the DCO, which says: > > By making a contribution to this project, I certify that: > > 1. The contribution was created in whole or in part by me, and > I have the right to submit it under the free software license > indicated in the file; or > > 2. The contribution is based upon previous work that, to the best of > my knowledge, is covered under an appropriate free software license, > and I have the right under that license to submit that work with > modifications, whether created in whole or in part by me, under the > same free software license (unless I am permitted to submit under a > different license), as indicated in the file; or > > 3. The contribution is a license text (or a file of similar nature), > and verbatim distribution is allowed; or > > 4. The contribution was provided directly to me by some other person > who certified 1., 2., 3., or 4., and I have not modified it. > > I have been thinking about this aspect too. Certainly there is some > overlap with our GLEP 76 policy, but I don't think that it is redundant. > > I'd rather see it as a (much needed) clarification how to deal with AI > generated code. All the better if the proposal happens to agree with > policies that are already in place. > > Ulrich This is my interpretation of it as well, especially when it comes to para. 2: >>> 2. The contribution is based upon previous work that, to the best of >>> my knowledge, is covered under an appropriate free software license, >>> [...] It is extremely difficult (if not impossible) to verify this with some of these tools, and that's assuming that the user of these tools knows enough about how they work where this is a concern to them. I would argue it's best to stay away from these tools at least until there is more clear and concise legal interpretation of their usage in relation to copyright. -- Kenton Groombridge Gentoo Linux Developer, SELinux Project signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
> On Tue, 27 Feb 2024, Rich Freeman wrote: > On Tue, Feb 27, 2024 at 9:45 AM Michał Górny wrote: >> >> Given the recent spread of the "AI" bubble, I think we really need to >> look into formally addressing the related concerns. First of all, I fully support mgorny's proposal. >> 1. Copyright concerns. > I do think it makes sense to consider some of this. > However, I feel like the proposal is redundant with the existing > requirement to signoff on the DCO, which says: By making a contribution to this project, I certify that: 1. The contribution was created in whole or in part by me, and I have the right to submit it under the free software license indicated in the file; or 2. The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate free software license, and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same free software license (unless I am permitted to submit under a different license), as indicated in the file; or 3. The contribution is a license text (or a file of similar nature), and verbatim distribution is allowed; or 4. The contribution was provided directly to me by some other person who certified 1., 2., 3., or 4., and I have not modified it. I have been thinking about this aspect too. Certainly there is some overlap with our GLEP 76 policy, but I don't think that it is redundant. I'd rather see it as a (much needed) clarification how to deal with AI generated code. All the better if the proposal happens to agree with policies that are already in place. Ulrich signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Michał Górny writes: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. > I agree with the proposal, just some thoughts below. I'm a bit worried this is slightly performative - which is not a dig at you at all - given we can't really enforce it, and it requires honesty, but that's also not a reason to not try ;) > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. > It also makes risk for anyone basing products or tools on Gentoo if we're not confident about the integrity / provenance of our work. > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. > > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. > > > Gentoo has always stood out as something different, something that > worked for people for whom mainstream distros were lacking. I think > adding "made by real people" to the list of our advantages would be > a good thing — but we need to have policies in place, to make sure shit > doesn't flow in. > > Compare with the shitstorm at: > https://github.com/pkgxdev/pantry/issues/5358 signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On 2024.02.27 14:45, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on > for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't > do > much about upstream projects using it. > > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that > pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff > we > can't legally use. > > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you > are > careful enough, but we can't really rely on all our contributors being > aware of the risks. > > 3. Ethical concerns. As pointed out above, the "AI" corporations > don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. > > > Gentoo has always stood out as something different, something that > worked for people for whom mainstream distros were lacking. I think > adding "made by real people" to the list of our advantages would be > a good thing — but we need to have policies in place, to make sure > shit > doesn't flow in. > > Compare with the shitstorm at: > https://github.com/pkgxdev/pantry/issues/5358 > > -- > Best regards, > Michał Górny > > Michał, An excellent piece of prose setting out the rationale. I fully support it. -- Regards, Roy Bamford (Neddyseagoon) a member of elections gentoo-ops forum-mods arm64 pgp0BJ299ipp6.pgp Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, Feb 27, 2024, at 08:45 CST, Michał Górny wrote: > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. +1 > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. This is my main concern, but all of the other points are valid as well. Best, Matthias
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, Feb 27, 2024 at 9:45 AM Michał Górny wrote: > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. > 1. Copyright concerns. I do think it makes sense to consider some of this. However, I feel like the proposal is redundant with the existing requirement to signoff on the DCO, which says: >>> By making a contribution to this project, I certify that: >>> 1. The contribution was created in whole or in part by me, and >>> I have the right to submit it under the free software license >>> indicated in the file; or >>> 2. The contribution is based upon previous work that, to the best of >>> my knowledge, is covered under an appropriate free software license, >>> and I have the right under that license to submit that work with >>> modifications, whether created in whole or in part by me, under the >>> same free software license (unless I am permitted to submit under a >>> different license), as indicated in the file; or >>> 3. The contribution is a license text (or a file of similar nature), >>> and verbatim distribution is allowed; or >>> 4. The contribution was provided directly to me by some other person >>> who certified 1., 2., 3., or 4., and I have not modified it. Perhaps we ought to just re-advertise the policy that already exists? > 2. Quality concerns. As far as quality is concerned, I again share the concerns you raise, and I think we should just re-emphasize what many other industries are already making clear - that individuals are responsible for the quality of their contributions. Copy/pasting it blindly from an AI is no different from copy/pasting it from some other random website, even if it is otherwise legal. > 3. Ethical concerns. I think it is best to just avoid taking a stand on this. Our ethics are already documented in the Social Contract. I think everybody agrees that what is right and wrong is obvious and clear and universal. Then we're all shocked to find that large numbers of people have a universal perspective different from our own. Even if 90% of contributors agree with a particular position, if we start lopping off parts of our community 10% at a time we'll probably find ourselves alone in a room sooner or later. We can't make every hill the one to die on. > I think adding "made by real people" to the list of our advantages > would be a good thing Somehow I doubt this is going to help us steal market share from the numerous other popular source-based Linux distros. :) To be clear, I don't think it is a bad idea to just reiterate that we aren't looking for help from people who want to create scripts that pipe things into some GPT API and pipe the output into a forum, bug, issue, PR, or commit. I've seen other FOSS projects struggling with people trying to be "helpful" in this way. I just don't think any of this actually requires new policy. If we find our policy to be inadequate I think it is better to go back to the core principles and better articulate what we're trying to achieve, rather than adjust it to fit the latest fashions. -- Rich
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, Feb 27, 2024 at 03:45:17PM +0100, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. +1 from me, a clear stance before it really start hitting Gentoo sounds good. -- ionen signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Am Dienstag, 27. Februar 2024, 15:45:17 CET schrieb Michał Górny: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. Fully agree and support this. > > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. [...] or implementing it. So, also, no objections against someone (a real person, by his own mental means) packaging AI software for Gentoo. -- Andreas K. Hüttel dilfri...@gentoo.org Gentoo Linux developer (council, toolchain, base-system, perl, libreoffice) signature.asc Description: This is a digitally signed message part.
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Marek Szuba writes: > On 2024-02-27 14:45, Michał Górny wrote: > >> In my opinion, at this point the only reasonable course of action >> would be to safely ban "AI"-backed contribution entirely. In other >> words, explicitly forbid people from using ChatGPT, Bard, GitHub >> Copilot, and so on, to create ebuilds, code, documentation, messages, >> bug reports and so on for use in Gentoo. > > I very much support this idea, for all the three reasons quoted. > >> 2. Quality concerns. LLMs are really great at generating plausibly >> looking bullshit. I suppose they can provide good assistance if you >> are careful enough, but we can't really rely on all our contributors >> being aware of the risks. > > https://arxiv.org/abs/2211.03622 > >> 3. Ethical concerns. > > ...yeah. Seeing as we failed to condemn the Russian invasion of > Ukraine in 2022, I would probably avoid quoting this as a reason for > banning LLM-generated contributions. Even though I do, as mentioned > above, very much agree with this point. That's not a technical topic and we had an extended discussion about what to do in -core, which included the risks of making life difficult for Russian developers and contributors. I don't think that's a helpful intervention here, sorry.
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On 2024-02-27 14:45, Michał Górny wrote: In my opinion, at this point the only reasonable course of action would be to safely ban "AI"-backed contribution entirely. In other words, explicitly forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to create ebuilds, code, documentation, messages, bug reports and so on for use in Gentoo. I very much support this idea, for all the three reasons quoted. 2. Quality concerns. LLMs are really great at generating plausibly looking bullshit. I suppose they can provide good assistance if you are careful enough, but we can't really rely on all our contributors being aware of the risks. https://arxiv.org/abs/2211.03622 3. Ethical concerns. ...yeah. Seeing as we failed to condemn the Russian invasion of Ukraine in 2022, I would probably avoid quoting this as a reason for banning LLM-generated contributions. Even though I do, as mentioned above, very much agree with this point. -- Marecki OpenPGP_signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On Tue, 27 Feb 2024 at 15:21, Kenton Groombridge wrote: > > On 24/02/27 03:45PM, Michał Górny wrote: > > Hello, > > > > Given the recent spread of the "AI" bubble, I think we really need to > > look into formally addressing the related concerns. In my opinion, > > at this point the only reasonable course of action would be to safely > > ban "AI"-backed contribution entirely. In other words, explicitly > > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > > create ebuilds, code, documentation, messages, bug reports and so on for > > use in Gentoo. > > > > Just to be clear, I'm talking about our "original" content. We can't do > > much about upstream projects using it. > > > > > > Rationale: > > > > 1. Copyright concerns. At this point, the copyright situation around > > generated content is still unclear. What's pretty clear is that pretty > > much all LLMs are trained on huge corpora of copyrighted material, and > > all fancy "AI" companies don't give shit about copyright violations. > > In particular, there's a good risk that these tools would yield stuff we > > can't legally use. > > > > 2. Quality concerns. LLMs are really great at generating plausibly > > looking bullshit. I suppose they can provide good assistance if you are > > careful enough, but we can't really rely on all our contributors being > > aware of the risks. > > > > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > > give shit about copyright, and don't give shit about people. The AI > > bubble is causing huge energy waste. It is giving a great excuse for > > layoffs and increasing exploitation of IT workers. It is driving > > enshittification of the Internet, it is empowering all kinds of spam > > and scam. > > > > > > Gentoo has always stood out as something different, something that > > worked for people for whom mainstream distros were lacking. I think > > adding "made by real people" to the list of our advantages would be > > a good thing — but we need to have policies in place, to make sure shit > > doesn't flow in. > > > > Compare with the shitstorm at: > > https://github.com/pkgxdev/pantry/issues/5358 > > > > -- > > Best regards, > > Michał Górny > > > > I completely agree. > > Your rationale hits the most important concerns I have about these > technologies in open source. There is a significant opportunity for > Gentoo to set the example here. > > -- > Kenton Groombridge > Gentoo Linux Developer, SELinux Project A thousand times yes.
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
On 24/02/27 03:45PM, Michał Górny wrote: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. > > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. > > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. > > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. > > > Gentoo has always stood out as something different, something that > worked for people for whom mainstream distros were lacking. I think > adding "made by real people" to the list of our advantages would be > a good thing — but we need to have policies in place, to make sure shit > doesn't flow in. > > Compare with the shitstorm at: > https://github.com/pkgxdev/pantry/issues/5358 > > -- > Best regards, > Michał Górny > I completely agree. Your rationale hits the most important concerns I have about these technologies in open source. There is a significant opportunity for Gentoo to set the example here. -- Kenton Groombridge Gentoo Linux Developer, SELinux Project signature.asc Description: PGP signature
Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Michał Górny writes: > Hello, > > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. > > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. > > > Rationale: > > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. > > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. > > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. > > > Gentoo has always stood out as something different, something that > worked for people for whom mainstream distros were lacking. I think > adding "made by real people" to the list of our advantages would be > a good thing — but we need to have policies in place, to make sure shit > doesn't flow in. > > Compare with the shitstorm at: > https://github.com/pkgxdev/pantry/issues/5358 +1. All I've seen from "generatative" (read: auto-plagiarizing) A"I" is spam and theft, and have the full intention of blocking it where-ever my vote counts. -- Arsen Arsenović signature.asc Description: PGP signature
[gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo
Hello, Given the recent spread of the "AI" bubble, I think we really need to look into formally addressing the related concerns. In my opinion, at this point the only reasonable course of action would be to safely ban "AI"-backed contribution entirely. In other words, explicitly forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to create ebuilds, code, documentation, messages, bug reports and so on for use in Gentoo. Just to be clear, I'm talking about our "original" content. We can't do much about upstream projects using it. Rationale: 1. Copyright concerns. At this point, the copyright situation around generated content is still unclear. What's pretty clear is that pretty much all LLMs are trained on huge corpora of copyrighted material, and all fancy "AI" companies don't give shit about copyright violations. In particular, there's a good risk that these tools would yield stuff we can't legally use. 2. Quality concerns. LLMs are really great at generating plausibly looking bullshit. I suppose they can provide good assistance if you are careful enough, but we can't really rely on all our contributors being aware of the risks. 3. Ethical concerns. As pointed out above, the "AI" corporations don't give shit about copyright, and don't give shit about people. The AI bubble is causing huge energy waste. It is giving a great excuse for layoffs and increasing exploitation of IT workers. It is driving enshittification of the Internet, it is empowering all kinds of spam and scam. Gentoo has always stood out as something different, something that worked for people for whom mainstream distros were lacking. I think adding "made by real people" to the list of our advantages would be a good thing — but we need to have policies in place, to make sure shit doesn't flow in. Compare with the shitstorm at: https://github.com/pkgxdev/pantry/issues/5358 -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part