Hi,

Ian Eure <i...@retrospec.tv> skribis:

> While this is what their paper claims[1], it doesn’t appear to be
> true, since I can see my own GPL’d code in the training set.  I’ve
> since moved nearly all of my code off GitHub, but if you visit their
> "Am I in The Stack?" page[2] and enter my old username ("ieure"), you
> will see pretty much every repository I ever hosted there, including
> both unlicensed and GPL’d code.

That’s not my experience: I looked for Guix and Coreutils, both GPL’d,
both mirrored on GitHub, and none of it is there.

> Some examples are hyperspace-el,
> nssh-el, tl1-mode, etc.  While there aren’t LICENSE files in those
> repos, the file headers of all clearly indicate that they’re GPL’d.

Well, not providing a COPYING/LICENSE file isn’t helping either: file
headers may not be all that clear to a parser.


At any rate, even though I’m watching this LLM trend with discontent
like many in the free software world, I believe this discussion is
missing the point and shooting the messenger(s).

One of the three missions of SWH is to share code—much like ftp.gnu.org.
That’s all they did.  Anyone can access the archive of SWH, for any
purpose.

HuggingFace trained “BigCode” on source SWH harvested from GitHub (a
subset of the SWH archive) and chose to abide by the principles put
forward by SWH in its Oct. 2023 statement.  HuggingFace didn’t have to
do that; they could have acted like Microsoft and all the “AI” companies
and just scrape everything without asking anyone—be it from SWH or from
other sources.


There is no “Software Heritage problem” and really, that very phrase and
the accusative tone in this thread is unwelcome and below our standards
for communication in Guix.  This has gone too far.  This is not the
place to further discuss the impact of using LLMs on free software, and
definitely not the place to throw unfounded accusations.

Thanks,
Ludo’.

Reply via email to