Hi Greg (CCing the "Wikimedia & GLAM collaboration" mailing list),

First, as there has been no reaction here yet: Congrats to you and Harvard
Law School Library on this release! A dataset of one million
high-quality-OCR public domain books sounds very impressive.

However, your message here, and in particular its highlighting of
*"time-bounded
Terms of Service that attempts to privilege open and noncommercial actors"*,
give the distinct impression that you are unaware of some central aspects
of Wikipedia and the Wikimedia movement, or indeed the wider free-culture
movement as well. While the Wikimedia Foundation is indeed a nonprofit
organization, and Wikipedia and the other Wikimedia projects are indeed
noncommercial, they have never accepted content licenses or terms that are
confined to "open and noncommercial actors". So let me link some
explanatory material:

The Wikimedia Foundation's licensing policy
<https://foundation.wikimedia.org/wiki/Resolution:Licensing_policy> (which
governs the content on Wikipedia and all other Wikimedia projects) relies
on *a definition of "free content" that excludes licenses limited to
noncommercial usage*, like your terms are. Summarizing the rationales for
this long-standing decision would go too far here - if you are interested
in those, this
<https://lists.wikimedia.org/hyperkitty/list/[email protected]/thread/JUNXXJPIZRMCFAPNJEGXPPENCOS6DOQW/>
might be a good starting point. But to highlight one well-known problem
with such licenses (in particularly the -NC variants of the Creative
Commons licenses), because it may help to illustrate some especially
problematic restrictions that you/your lawyers attempt to impose: People
have found out time and again that it is difficult to actually define
commercial usage, in a way that doesn't have unintended consequences. (E.g.
could a hobbyist blogger be sued for using an NC-licensed image because her
blog features some Google ads?) Creative Commons even ran a whole study in
an attempt to retroactively clarify such boundaries.

But in any case, despite these well-documented complications, legal
restrictions about the commercial *usage* of particular material still seem
more straightforward to figure out than the *restrictions on "intent" and
"affiliation" of the *user** that you (or Harvard's lawyers?) try to impose
in the terms of use for this release
<https://huggingface.co/datasets/institutional/institutional-books-1.0>:

"Open-source projects and other public-use efforts are welcome, even if
they may indirectly support commercial use, so long as they are
unaffiliated with commercial actors or intent."

Your requirement that an open source project must not even be "affiliated"
with "commercial ... intent", would likely exclude, say, the majority of
widely used (e.g. by Wikimedia organizations
<https://meta.wikimedia.org/wiki/FLOSS-Exchange>) open source software
projects, which are frequently either maintained by a commercial company,
or by volunteers who also have a related day job as developer or may offer
paid support. Even the most anticapitalist purists in the free software
movement shy away from such restrictions in their licenses.
In any case, we can be pretty sure that your clause rules out the Wikimedia
Foundation, as it is not just "affiliated" with a commercial actor but has
one directly incorporated as a subsidiary, namely the for-profit Wikimedia
LLC. You don't seem to be aware of this, given that you came here with the
apparent impression that an offer to "privilege open and noncommercial
actors" may enable a cooperation.

The second clause of your terms
<https://x.com/tilmanbayer/status/1933311788688552165> (*"No
Redistribution"*) is likewise a non-starter for "open actors" - it is
almost the definition of non-open.

I do realize of course that there will be many AI/ML folks on HF and
elsewhere who are happy to use such a dataset while blissfully ignore such
attempts to impose restrictions on public domain content, perhaps assuming
- possibly correctly - that you didn't think these terms of use through
very thoroughly and are thus unlikely to enforce them, or who are simply
not yet as familiar with the long-term effects of such legal footguns as
Wikimedians and FLOSS developers have become over many years. That said,
I've seen your terms cause consternation in the open AI/ML world too, e.g.
on the EleutherAI Discord.

You should also be aware that in the history of the Wikimedia movement
there have been some some ugly *legal disputes with GLAMs* (galleries,
libraries, archives and museums, i.e. organizations like yours) who *attempted
to restrict reproduction of public domain works* in their possession with
similar rationales (i.e. an alleged need to extract revenue to refinance
digitization efforts or such, which I hear echoing in your vague remarks
about "sustainability" "ecosystem" etc). Two examples:

   -
   
https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_Wikimedia_Foundation_copyright_dispute

   - https://en.wikipedia.org/wiki/Reiss_Engelhorn_Museum#Wikimedia_lawsuit
   (While that museum prevailed in court against the Wikimedia Foundation, the
   EU Copyright Directive subsequently made such assertions of copyright over
   faithful reproductions of public domain works impossible.)

I'm not saying that the Institutional Books project is likely to become
similarly contentious (if only for the simple reason that Wikimedians have
long already been importing the same underlying Google Books scans
<https://commons.wikimedia.org/wiki/Category:Scans_from_Google_Books>,
often to do their own OCR and proofreading on Wikisource). I'm just trying
to help you understand that the restrictions on public access that you
attempt to impose here under the label of "public-interest leverage" - i.e.
your own institution retaining control over the content so you can monetize
it - are likely to be seen as unacceptable by the open content movement.

Another point you should be aware of is that while Wikimedia volunteers
spend a lot of time diligently enforcing the copyrights of third parties
(by deleting infringing material uploaded to Wikimedia projects), they
*explicitly
reject
<https://commons.wikimedia.org/wiki/Commons:Non-copyright_restrictions>
enforcing
non-copyright terms* imposed by such third parties.

Lastly, a question:
You say here
<https://www.institutionaldatainitiative.org/posts/open-call-for-collaborators>
that you (the Institutional Data Initiative) are "one of the
Harvard-affiliated beneficiaries of OpenAI's new NextGenAI consortium". *Is
OpenAI also one of your customers* paying for privileged access to the
Institutional Books dataset (while your terms exclude the general public
from it for the time being)?
I'm not arguing that OpenAI is evil per se, or that academic institutions
and GLAMs must never collaborate with Big Tech companies. (After all,
Google Books, which your project is based on, was such a collaboration
between Big Tech and academic libraries in the first place. And many
Wikipedians can testify to its great value and usefulness for the general
public.) However, the obfuscatory language in your post here regarding
commercial partnerships and monetization ("garnering support from
commercial actors as we iterate on sustainability"), combined with vague
gesturing at a possible time-delayed free release at an undetermined point
in the future, doesn't exactly inspire trust in this matter. If the project
provides more transparent information about this question elsewhere, feel
free to provide pointers. It would also be interesting to learn how much
revenue the Institutional Data Initiative projects to derive from this
monetization of public domain works.

Regards, Tilman ([[User:HaeB]])

On Mon, Jul 7, 2025 at 7:32 AM Leppert, Greg <[email protected]>
wrote:

> Hi all. Great to meet you and thank you to Leila for inviting me to join
> the list. I’m the Executive Director of the Institutional Data Initiative<
> https://www.institutionaldatainitiative.org> (IDI) at Harvard and I
> wanted to share our recent data release—Institutional Books<
> https://www.institutionaldatainitiative.org/institutional-books>, a
> collection of nearly 1M public domain books, scanned at Harvard Library
> through the Google Books project.
>
> IDI works with libraries and other knowledge institutions to publish their
> collections as data with the goal of establishing public-interest leverage
> in the AI ecosystem while improving collections for traditional patron
> usage. With each project, we look for novel ways to structure and analyze
> the collection and set standards along the way. With Institutional Books,
> we tackled language analysis, topic classification, and OCR correction, and
> our technical report<https://arxiv.org/abs/2506.08300> has even more. We
> hope to evolve the collection over time and release new formats as we go,
> such as EPUB and Markdown.
>
> We’re also using this moment to experiment with a time-bounded Terms of
> Service that attempts to privilege open and noncommercial actors while
> garnering support from commercial actors as we iterate on sustainability.
> The goal is to eventually make the collection and all of its scans
> available under a more traditional open model.
>
> Thoughts, questions, and collaboration welcomed. We also have a Slack
> where we’re talking about this collection and others. Or next project is to
> dig in on a new collection of old newspapers, in collaboration with Boston
> Public Library, as we work toward building a global commons.
>
> —Greg
> _______________________________________________
> Wiki-research-l mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
GLAM mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to