Simon Tournier <zimon.touto...@gmail.com> writes:
Hi,
On sam., 16 mars 2024 at 08:52, Ian Eure <i...@retrospec.tv>
wrote:
They appear to be using the archive to build LLMs:
https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/
About LLM, Software Heritage made a clear statement:
https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code
Quoting:
We feel that the question is no longer whether LLMs for
code
should be built. They are already being built,
independently of
what we do, and there is no turning back. The real
question is
how they should be built and whom they should benefit.
Principles:
1. Knowledge derived from the Software Heritage archive
must be
given back to humanity, rather than monopolized for
private
gain. The resulting machine learning models must be made
available
under a suitable open license, together with the
documentation and
toolings needed to use them.
2. The initial training data extracted from the Software
Heritage
archive must be fully and precisely identified by, for
example,
publishing the corresponding SWHID identifiers (note
that, in the
context of Software Heritage, public availability of the
initial
training data is a given: anyone can obtain it from the
archive). This will enable use cases such as: studying
biases
(fairness), verifying if a code of interest was present
in the
training data (transparency), and providing appropriate
attribution
when generated code bears resemblance to training data
(credit),
among others.
3. Mechanisms should be established, where possible, for
authors to
exclude their archived code from the training inputs
before model
training begins.
I hope it clarifies your concerns to some extent.
It doesn’t clarify them, but it does illustrate them.
HuggingFace and the StarCoder2 model is in violation of principle
2. By their own admission, they are including code without clear
licensing[1]:
The main difference between the Stack v2 and the Stack v1 is
that we
include both permissively licensed and unlicensed files.
HuggingFace’s StarChat2 Playground[2] also violates this
principle, as it outputs code without any license or provenance
information; I know, because I tried it. While their own terms of
use for StarCoder2 state:
Any use of all or part of the code gathered in The Stack v2
must abide by
the terms of the original licenses...
...their own playground makes this impossible.
HuggingFace is also in violation of the third principle, because
they haven’t established a functioning opt-out model[3]. Opting
out requires using non-free software; requests have been sitting
for nearly a year with no action or response; and out of every
request submitted, only a single one has *ever* been honored.
They appear to be violating free software licenses on large scale.
They are in violation of SWH’s own positions.
Moreover, you wrote: « I want absolutely nothing to do with
them. »
Maybe there is a misunderstanding on your side about what “free
software” and GPL means because once “free software”, you cannot
prevent
people to use “your” free software for any purposes you dislike.
If you want to bound the use cases of the software you create,
you need
to explicitly specify that in the license. And if you do, your
software
will not be considered as “free software”.
That’s the double sword of “free software”. :-)
I am crystal clear on the meaning of free software. I wish to
remove it from these models *in order to* keep it free.
Thanks,
— Ian
[1]: https://arxiv.org/html/2402.19173v1
[2]:
https://huggingface.co/spaces/HuggingFaceH4/starchat2-playground
[3]: https://huggingface.co/datasets/bigcode/the-stack-v2
[4]: https://github.com/bigcode-project/opt-out-v2/issues