[Wiki-research-l] Re: Institutional Books

Leppert, Greg Mon, 25 Aug 2025 10:43:35 -0700

Thanks for sharing your thoughts, Tilman. I fear I should have mentioned that 
our goal isn’t to replicate or strictly adhere to existing movements. Given 
that I’m new to the list, I wouldn’t be surprised to learn that that means my 
original message was off topic or, at the least confusing. Apologies if so! 
Regardless, I appreciate you taking the time to articulate your perspective.


—Greg

On Aug 24, 2025, at 2:41 AM, Tilman Bayer <[email protected]> wrote:


Hi Greg (CCing the "Wikimedia & GLAM collaboration" mailing list),

First, as there has been no reaction here yet: Congrats to you and Harvard Law 
School Library on this release! A dataset of one million high-quality-OCR 
public domain books sounds very impressive.

However, your message here, and in particular its highlighting of "time-bounded 
Terms of Service that attempts to privilege open and noncommercial actors", 
give the distinct impression that you are unaware of some central aspects of 
Wikipedia and the Wikimedia movement, or indeed the wider free-culture movement 
as well. While the Wikimedia Foundation is indeed a nonprofit organization, and 
Wikipedia and the other Wikimedia projects are indeed noncommercial, they have 
never accepted content licenses or terms that are confined to "open and 
noncommercial actors". So let me link some explanatory material:

The Wikimedia Foundation's licensing 
policy<https://urldefense.proofpoint.com/v2/url?u=https-3A__foundation.wikimedia.org_wiki_Resolution-3ALicensing-5Fpolicy&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=4CToRMIrKQ3HCmLZAJKRIum1vGPBNPIdE5t2QjBC4NQ&e=>
 (which governs the content on Wikipedia and all other Wikimedia projects) 
relies on a definition of "free content" that excludes licenses limited to 
noncommercial usage, like your terms are. Summarizing the rationales for this 
long-standing decision would go too far here - if you are interested in those, 
this<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_hyperkitty_list_wikimedia-2Dl-40lists.wikimedia.org_thread_JUNXXJPIZRMCFAPNJEGXPPENCOS6DOQW_&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=r0rL62RAxpoKD9V5YY4kQoMD2oFJbu_j5o2l98M2JDk&e=>
 might be a good starting point. But to highlight one well-known problem with 
such licenses (in particularly the -NC variants of the Creative Commons 
licenses), because it may help to illustrate some especially problematic 
restrictions that you/your lawyers attempt to impose: People have found out 
time and again that it is difficult to actually define commercial usage, in a 
way that doesn't have unintended consequences. (E.g. could a hobbyist blogger 
be sued for using an NC-licensed image because her blog features some Google 
ads?) Creative Commons even ran a whole study in an attempt to retroactively 
clarify such boundaries.

But in any case, despite these well-documented complications, legal 
restrictions about the commercial *usage* of particular material still seem 
more straightforward to figure out than the restrictions on "intent" and 
"affiliation" of the *user* that you (or Harvard's lawyers?) try to impose in 
the terms of use for this 
release<https://urldefense.proofpoint.com/v2/url?u=https-3A__huggingface.co_datasets_institutional_institutional-2Dbooks-2D1.0&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=oQ_QN9dHgzL1N5pIBzwn5WRtEg2_FoXpHCbjLMNppWo&e=>:
"Open-source projects and other public-use efforts are welcome, even if they 
may indirectly support commercial use, so long as they are unaffiliated with 
commercial actors or intent."
Your requirement that an open source project must not even be "affiliated" with 
"commercial ... intent", would likely exclude, say, the majority of widely used 
(e.g. by Wikimedia 
organizations<https://urldefense.proofpoint.com/v2/url?u=https-3A__meta.wikimedia.org_wiki_FLOSS-2DExchange&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=N-EVfNvmoFS4D3NB2tTVDnPd33XfjLYgACHgbnkOyRA&e=>)
 open source software projects, which are frequently either maintained by a 
commercial company, or by volunteers who also have a related day job as 
developer or may offer paid support. Even the most anticapitalist purists in 
the free software movement shy away from such restrictions in their licenses.
In any case, we can be pretty sure that your clause rules out the Wikimedia 
Foundation, as it is not just "affiliated" with a commercial actor but has one 
directly incorporated as a subsidiary, namely the for-profit Wikimedia LLC. You 
don't seem to be aware of this, given that you came here with the apparent 
impression that an offer to "privilege open and noncommercial actors" may 
enable a cooperation.

The second clause of your 
terms<https://urldefense.proofpoint.com/v2/url?u=https-3A__x.com_tilmanbayer_status_1933311788688552165&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=VnAmO8pRCXtgKALbw8W4yngG4kn10c0_d8fDMClihUA&e=>
 ("No Redistribution") is likewise a non-starter for "open actors" - it is 
almost the definition of non-open.

I do realize of course that there will be many AI/ML folks on HF and elsewhere 
who are happy to use such a dataset while blissfully ignore such attempts to 
impose restrictions on public domain content, perhaps assuming - possibly 
correctly - that you didn't think these terms of use through very thoroughly 
and are thus unlikely to enforce them, or who are simply not yet as familiar 
with the long-term effects of such legal footguns as Wikimedians and FLOSS 
developers have become over many years. That said, I've seen your terms cause 
consternation in the open AI/ML world too, e.g. on the EleutherAI Discord.

You should also be aware that in the history of the Wikimedia movement there 
have been some some ugly legal disputes with GLAMs (galleries, libraries, 
archives and museums, i.e. organizations like yours) who attempted to restrict 
reproduction of public domain works in their possession with similar rationales 
(i.e. an alleged need to extract revenue to refinance digitization efforts or 
such, which I hear echoing in your vague remarks about "sustainability" 
"ecosystem" etc). Two examples:

  *   
https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_Wikimedia_Foundation_copyright_dispute<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_National-5FPortrait-5FGallery-5Fand-5FWikimedia-5FFoundation-5Fcopyright-5Fdispute&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=5NY2mQ8YwzqP-WGeyI5hKk7ToNWeLxSRIyI6NkvT_WA&e=>
  *   
https://en.wikipedia.org/wiki/Reiss_Engelhorn_Museum#Wikimedia_lawsuit<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Reiss-5FEngelhorn-5FMuseum-23Wikimedia-5Flawsuit&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=CxXbfJ0IrJJN5xz3SBkDqnoaAOtO6zaw5sF7tGwUiRI&e=>
 (While that museum prevailed in court against the Wikimedia Foundation, the EU 
Copyright Directive subsequently made such assertions of copyright over 
faithful reproductions of public domain works impossible.)

I'm not saying that the Institutional Books project is likely to become 
similarly contentious (if only for the simple reason that Wikimedians have long 
already been importing the same underlying Google Books 
scans<https://urldefense.proofpoint.com/v2/url?u=https-3A__commons.wikimedia.org_wiki_Category-3AScans-5Ffrom-5FGoogle-5FBooks&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=uBJm-kA1p9k5L0A-sHOD6GqHFy-vE27oP2VWKSFio6A&e=>,
 often to do their own OCR and proofreading on Wikisource). I'm just trying to 
help you understand that the restrictions on public access that you attempt to 
impose here under the label of "public-interest leverage" - i.e. your own 
institution retaining control over the content so you can monetize it - are 
likely to be seen as unacceptable by the open content movement.

Another point you should be aware of is that while Wikimedia volunteers spend a 
lot of time diligently enforcing the copyrights of third parties (by deleting 
infringing material uploaded to Wikimedia projects), they explicitly 
reject<https://urldefense.proofpoint.com/v2/url?u=https-3A__commons.wikimedia.org_wiki_Commons-3ANon-2Dcopyright-5Frestrictions&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=CeNO69oXZGBF0-ADSqSmYh4lg25bE8xKOQgopkFmdkM&e=>
 enforcing non-copyright terms imposed by such third parties.

Lastly, a question:
You say 
here<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org_posts_open-2Dcall-2Dfor-2Dcollaborators&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=RfzlyHhoZMsN_ioUOaaQK5txDp4TuEEMuRdIwLmfHuU&e=>
 that you (the Institutional Data Initiative) are "one of the 
Harvard-affiliated beneficiaries of OpenAI's new NextGenAI consortium". Is 
OpenAI also one of your customers paying for privileged access to the 
Institutional Books dataset (while your terms exclude the general public from 
it for the time being)?
I'm not arguing that OpenAI is evil per se, or that academic institutions and 
GLAMs must never collaborate with Big Tech companies. (After all, Google Books, 
which your project is based on, was such a collaboration between Big Tech and 
academic libraries in the first place. And many Wikipedians can testify to its 
great value and usefulness for the general public.) However, the obfuscatory 
language in your post here regarding commercial partnerships and monetization 
("garnering support from commercial actors as we iterate on sustainability"), 
combined with vague gesturing at a possible time-delayed free release at an 
undetermined point in the future, doesn't exactly inspire trust in this matter. 
If the project provides more transparent information about this question 
elsewhere, feel free to provide pointers. It would also be interesting to learn 
how much revenue the Institutional Data Initiative projects to derive from this 
monetization of public domain works.

Regards, Tilman ([[User:HaeB]])

On Mon, Jul 7, 2025 at 7:32 AM Leppert, Greg 
<[email protected]<mailto:[email protected]>> wrote:
Hi all. Great to meet you and thank you to Leila for inviting me to join the 
list. I’m the Executive Director of the Institutional Data 
Initiative<https://www.institutionaldatainitiative.org<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=XuvnyK0eUIPnMcHH9dXe47fE7PSsmqlq0l2RTbB-HRA&e=>>
 (IDI) at Harvard and I wanted to share our recent data release—Institutional 
Books<https://www.institutionaldatainitiative.org/institutional-books<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org_institutional-2Dbooks&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=W3zGmg5QWmuXxuCceBv_1OdozrPwAp-LH-sJoiIs3P0&e=>>,
 a collection of nearly 1M public domain books, scanned at Harvard Library 
through the Google Books project.

IDI works with libraries and other knowledge institutions to publish their 
collections as data with the goal of establishing public-interest leverage in 
the AI ecosystem while improving collections for traditional patron usage. With 
each project, we look for novel ways to structure and analyze the collection 
and set standards along the way. With Institutional Books, we tackled language 
analysis, topic classification, and OCR correction, and our technical 
report<https://arxiv.org/abs/2506.08300<https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_2506.08300&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=n3I0Od7D9ojL2r-L8LBGHvKSmDmu6jkdG7V_kB6Yqdw&e=>>
 has even more. We hope to evolve the collection over time and release new 
formats as we go, such as EPUB and Markdown.

We’re also using this moment to experiment with a time-bounded Terms of Service 
that attempts to privilege open and noncommercial actors while garnering 
support from commercial actors as we iterate on sustainability. The goal is to 
eventually make the collection and all of its scans available under a more 
traditional open model.

Thoughts, questions, and collaboration welcomed. We also have a Slack where 
we’re talking about this collection and others. Or next project is to dig in on 
a new collection of old newspapers, in collaboration with Boston Public 
Library, as we work toward building a global commons.

—Greg
_______________________________________________
Wiki-research-l mailing list -- 
[email protected]<mailto:[email protected]>
To unsubscribe send an email to 
[email protected]<mailto:[email protected]>
_______________________________________________
Wiki-research-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wiki-research-l] Re: Institutional Books

Reply via email to