Hi, Ian Eure <i...@retrospec.tv> skribis:
> While this is what their paper claims[1], it doesn’t appear to be > true, since I can see my own GPL’d code in the training set. I’ve > since moved nearly all of my code off GitHub, but if you visit their > "Am I in The Stack?" page[2] and enter my old username ("ieure"), you > will see pretty much every repository I ever hosted there, including > both unlicensed and GPL’d code. That’s not my experience: I looked for Guix and Coreutils, both GPL’d, both mirrored on GitHub, and none of it is there. > Some examples are hyperspace-el, > nssh-el, tl1-mode, etc. While there aren’t LICENSE files in those > repos, the file headers of all clearly indicate that they’re GPL’d. Well, not providing a COPYING/LICENSE file isn’t helping either: file headers may not be all that clear to a parser. At any rate, even though I’m watching this LLM trend with discontent like many in the free software world, I believe this discussion is missing the point and shooting the messenger(s). One of the three missions of SWH is to share code—much like ftp.gnu.org. That’s all they did. Anyone can access the archive of SWH, for any purpose. HuggingFace trained “BigCode” on source SWH harvested from GitHub (a subset of the SWH archive) and chose to abide by the principles put forward by SWH in its Oct. 2023 statement. HuggingFace didn’t have to do that; they could have acted like Microsoft and all the “AI” companies and just scrape everything without asking anyone—be it from SWH or from other sources. There is no “Software Heritage problem” and really, that very phrase and the accusative tone in this thread is unwelcome and below our standards for communication in Guix. This has gone too far. This is not the place to further discuss the impact of using LLMs on free software, and definitely not the place to throw unfounded accusations. Thanks, Ludo’.