On 3/27/26 9:27 AM, Luca Toniolo wrote:
Copilot doing statistical analysis on publicly available GPL code
is, if anything, less than what the GPL already explicitly permits.
Yes, as long as you abide by the license.
But LLMs do much more than just statistical analysis. LLMs generate
output from the training set and people are encouraged to use that output.
The problem is that LLMs are known to reproduce their input/training
data. The problem is that they reproduce training/learned code and
stripped the GPL license from that code. That is the real problem.
The fact that we can't prevent these corporations from scraping and
doing this is a fact of how the Internet works. However, the fact that
they did it does not make it right or their use legal.
Mailing list archives have been indexed by Google, crawled by the
Wayback Machine, scraped by researchers, and read by recruiters for as
long as they've existed. Our commit messages, review comments, and
design discussions have been public and searchable for years. That was
true before Copilot, and it would remain true if we moved to GitLab,
Codeberg, or a self-hosted Gitea instance tomorrow. None of these
platforms prevent scraping.
It is not only about what is publicly visible on the site(s). It is
about the use and process how you do things.
The information that is available *inside* github about you and what you
are doing are quite more extensive than what can be viewed from the
public record.
The announcement from github makes, in principle, any and all data
subject to input into their LLMs. That I cannot accept and will
seriously consider my options.
GPL enforcement, even in clear-cut cases of actual license violation,
has historically been rare and difficult. The FSF and SFLC have pursued
only the most egregious cases, and even those took years. LinuxCNC
itself has never enforced the GPL against anyone.
The non-enforcement of copyright violations does _not_ make it alright
to become an infringer or to condone copyright infringement. Besides,
the cases that were enforced were victory for the GPL and made many an
infringer think twice or back off.
That is not to say that there are many uncaught infringers. There are
and we should all discourage that where ever and how ever we can.
The idea of taking drastic action over something that may not even
constitute a violation seems disproportionate.
That is unsettled case law.
However, the action is not just taken over copyrights. The action would
also be taken to prevent a commercial entity from exploiting internal
insights they acquire from us using the site.
Besides, it sends a strong message that their (github's) behaviour will
result in users changing their ways.
If we migrate off GitHub, what do we actually gain? We lose CI
infrastructure that works, we lose contributor familiarity, we lose
discoverability for new contributors, we lose issue and PR history, and
we solve nothing, because the code was already scraped, the mailing
lists were already indexed,
We gain independence from a corporate entity controlling the
infrastructure and data we generate in development.
CI is not that difficult, but we'd need to rebuild. IMO a small price
for what we gain.
Commit history is in git. We can extract issues and PR data. You know,
scrape it? ;-)
Discoverability, hm... Use a search engine on the Internet: find
linuxcnc.org -> link to development. How difficult is that? Not that
we've been very active at promoting ourselves in the past 20 years or so...
and the next platform will face the same reality.
The next platform will not necessarily have that same reality. That is
why Codeberg is such a good option, they are a non-profit with an
outspoken goal to support and further FOSS
(https://docs.codeberg.org/getting-started/what-is-codeberg/).
--
Greetings Bertho
(disclaimers are disclaimed)
_______________________________________________
Emc-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/emc-developers