Re: Concerns/questions around Software Heritage Archive

Ian Eure Mon, 18 Mar 2024 13:18:13 -0700


Simon Tournier <zimon.touto...@gmail.com> writes:

Hi,
On sam., 16 mars 2024 at 08:52, Ian Eure <i...@retrospec.tv>wrote:
They appear to be using the archive to build LLMs:https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
About LLM, Software Heritage made a clear statement:

    https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code

Quoting:
We feel that the question is no longer whether LLMs forcodeshould be built. They are already being built,independently ofwhat we do, and there is no turning back. The realquestion is
        how they should be built and whom they should benefit.

Principles:
1. Knowledge derived from the Software Heritage archivemust begiven back to humanity, rather than monopolized forprivategain. The resulting machine learning models must be madeavailableunder a suitable open license, together with thedocumentation and
        toolings needed to use them.
2. The initial training data extracted from the SoftwareHeritagearchive must be fully and precisely identified by, forexample,publishing the corresponding SWHID identifiers (notethat, in thecontext of Software Heritage, public availability of theinitial
        training data is a given: anyone can obtain it from the
archive). This will enable use cases such as: studyingbiases(fairness), verifying if a code of interest was presentin thetraining data (transparency), and providing appropriateattributionwhen generated code bears resemblance to training data(credit),
        among others.
3. Mechanisms should be established, where possible, forauthors toexclude their archived code from the training inputsbefore model
        training begins.

I hope it clarifies your concerns to some extent.


It doesn’t clarify them, but it does illustrate them.

HuggingFace and the StarCoder2 model is in violation of principle2. By their own admission, they are including code without clearlicensing[1]:

The main difference between the Stack v2 and the Stack v1 isthat we

   include both permissively licensed and unlicensed files.

HuggingFace’s StarChat2 Playground[2] also violates thisprinciple, as it outputs code without any license or provenanceinformation; I know, because I tried it. While their own terms ofuse for StarCoder2 state:

Any use of all or part of the code gathered in The Stack v2must abide by

   the terms of the original licenses...

...their own playground makes this impossible.

HuggingFace is also in violation of the third principle, becausethey haven’t established a functioning opt-out model[3]. Optingout requires using non-free software; requests have been sittingfor nearly a year with no action or response; and out of everyrequest submitted, only a single one has *ever* been honored.

They appear to be violating free software licenses on large scale.They are in violation of SWH’s own positions.

Moreover, you wrote: « I want absolutely nothing to do withthem. »
Maybe there is a misunderstanding on your side about what “free
software” and GPL means because once “free software”, you cannotprevent
people to use “your” free software for any purposes you dislike.
If you want to bound the use cases of the software you create,you needto explicitly specify that in the license. And if you do, yoursoftware
will not be considered as “free software”.

That’s the double sword of “free software”. :-)

I am crystal clear on the meaning of free software. I wish toremove it from these models *in order to* keep it free.


Thanks,

 — Ian

[1]: https://arxiv.org/html/2402.19173v1

[2]:https://huggingface.co/spaces/HuggingFaceH4/starchat2-playground

[3]: https://huggingface.co/datasets/bigcode/the-stack-v2
[4]: https://github.com/bigcode-project/opt-out-v2/issues

Re: Concerns/questions around Software Heritage Archive

Reply via email to