[Wiki-research-l] Re: Institutional Books

Ziko van Dijk Mon, 25 Aug 2025 11:05:31 -0700

Hello Tilman,

Thank you for your explanations concerning NC. This reminded me of a
discussion during the "Strategy" talks, when the strategy group tried to
introduce NC and ND provisions in the Wikimedia wikis. The negative
reactions were... quite notable.
https://meta.wikimedia.org/wiki/Talk:Movement_Strategy/Recommendations/Iteration_1/Diversity/9


Kind regards
Ziko


Am Mo., 25. Aug. 2025 um 19:43 Uhr schrieb Leppert, Greg <
[email protected]>:

> Thanks for sharing your thoughts, Tilman. I fear I should have mentioned
> that our goal isn’t to replicate or strictly adhere to existing movements.
> Given that I’m new to the list, I wouldn’t be surprised to learn that that
> means my original message was off topic or, at the least confusing.
> Apologies if so! Regardless, I appreciate you taking the time to articulate
> your perspective.
>
> —Greg
>
> On Aug 24, 2025, at 2:41 AM, Tilman Bayer <[email protected]> wrote:
>
> 
> Hi Greg (CCing the "Wikimedia & GLAM collaboration" mailing list),
>
> First, as there has been no reaction here yet: Congrats to you and Harvard
> Law School Library on this release! A dataset of one million
> high-quality-OCR public domain books sounds very impressive.
>
> However, your message here, and in particular its highlighting of
> "time-bounded Terms of Service that attempts to privilege open and
> noncommercial actors", give the distinct impression that you are unaware of
> some central aspects of Wikipedia and the Wikimedia movement, or indeed the
> wider free-culture movement as well. While the Wikimedia Foundation is
> indeed a nonprofit organization, and Wikipedia and the other Wikimedia
> projects are indeed noncommercial, they have never accepted content
> licenses or terms that are confined to "open and noncommercial actors". So
> let me link some explanatory material:
>
> The Wikimedia Foundation's licensing policy<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__foundation.wikimedia.org_wiki_Resolution-3ALicensing-5Fpolicy&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=4CToRMIrKQ3HCmLZAJKRIum1vGPBNPIdE5t2QjBC4NQ&e=>
> (which governs the content on Wikipedia and all other Wikimedia projects)
> relies on a definition of "free content" that excludes licenses limited to
> noncommercial usage, like your terms are. Summarizing the rationales for
> this long-standing decision would go too far here - if you are interested
> in those, this<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_hyperkitty_list_wikimedia-2Dl-40lists.wikimedia.org_thread_JUNXXJPIZRMCFAPNJEGXPPENCOS6DOQW_&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=r0rL62RAxpoKD9V5YY4kQoMD2oFJbu_j5o2l98M2JDk&e=>
> might be a good starting point. But to highlight one well-known problem
> with such licenses (in particularly the -NC variants of the Creative
> Commons licenses), because it may help to illustrate some especially
> problematic restrictions that you/your lawyers attempt to impose: People
> have found out time and again that it is difficult to actually define
> commercial usage, in a way that doesn't have unintended consequences. (E.g.
> could a hobbyist blogger be sued for using an NC-licensed image because her
> blog features some Google ads?) Creative Commons even ran a whole study in
> an attempt to retroactively clarify such boundaries.
>
> But in any case, despite these well-documented complications, legal
> restrictions about the commercial *usage* of particular material still seem
> more straightforward to figure out than the restrictions on "intent" and
> "affiliation" of the *user* that you (or Harvard's lawyers?) try to impose
> in the terms of use for this release<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__huggingface.co_datasets_institutional_institutional-2Dbooks-2D1.0&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=oQ_QN9dHgzL1N5pIBzwn5WRtEg2_FoXpHCbjLMNppWo&e=
> >:
> "Open-source projects and other public-use efforts are welcome, even if
> they may indirectly support commercial use, so long as they are
> unaffiliated with commercial actors or intent."
> Your requirement that an open source project must not even be "affiliated"
> with "commercial ... intent", would likely exclude, say, the majority of
> widely used (e.g. by Wikimedia organizations<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__meta.wikimedia.org_wiki_FLOSS-2DExchange&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=N-EVfNvmoFS4D3NB2tTVDnPd33XfjLYgACHgbnkOyRA&e=>)
> open source software projects, which are frequently either maintained by a
> commercial company, or by volunteers who also have a related day job as
> developer or may offer paid support. Even the most anticapitalist purists
> in the free software movement shy away from such restrictions in their
> licenses.
> In any case, we can be pretty sure that your clause rules out the
> Wikimedia Foundation, as it is not just "affiliated" with a commercial
> actor but has one directly incorporated as a subsidiary, namely the
> for-profit Wikimedia LLC. You don't seem to be aware of this, given that
> you came here with the apparent impression that an offer to "privilege open
> and noncommercial actors" may enable a cooperation.
>
> The second clause of your terms<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__x.com_tilmanbayer_status_1933311788688552165&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=VnAmO8pRCXtgKALbw8W4yngG4kn10c0_d8fDMClihUA&e=>
> ("No Redistribution") is likewise a non-starter for "open actors" - it is
> almost the definition of non-open.
>
> I do realize of course that there will be many AI/ML folks on HF and
> elsewhere who are happy to use such a dataset while blissfully ignore such
> attempts to impose restrictions on public domain content, perhaps assuming
> - possibly correctly - that you didn't think these terms of use through
> very thoroughly and are thus unlikely to enforce them, or who are simply
> not yet as familiar with the long-term effects of such legal footguns as
> Wikimedians and FLOSS developers have become over many years. That said,
> I've seen your terms cause consternation in the open AI/ML world too, e.g.
> on the EleutherAI Discord.
>
> You should also be aware that in the history of the Wikimedia movement
> there have been some some ugly legal disputes with GLAMs (galleries,
> libraries, archives and museums, i.e. organizations like yours) who
> attempted to restrict reproduction of public domain works in their
> possession with similar rationales (i.e. an alleged need to extract revenue
> to refinance digitization efforts or such, which I hear echoing in your
> vague remarks about "sustainability" "ecosystem" etc). Two examples:
>
>   *
> https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_Wikimedia_Foundation_copyright_dispute
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_National-5FPortrait-5FGallery-5Fand-5FWikimedia-5FFoundation-5Fcopyright-5Fdispute&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=5NY2mQ8YwzqP-WGeyI5hKk7ToNWeLxSRIyI6NkvT_WA&e=
> >
>   *
> https://en.wikipedia.org/wiki/Reiss_Engelhorn_Museum#Wikimedia_lawsuit<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Reiss-5FEngelhorn-5FMuseum-23Wikimedia-5Flawsuit&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=CxXbfJ0IrJJN5xz3SBkDqnoaAOtO6zaw5sF7tGwUiRI&e=>
> (While that museum prevailed in court against the Wikimedia Foundation, the
> EU Copyright Directive subsequently made such assertions of copyright over
> faithful reproductions of public domain works impossible.)
>
> I'm not saying that the Institutional Books project is likely to become
> similarly contentious (if only for the simple reason that Wikimedians have
> long already been importing the same underlying Google Books scans<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__commons.wikimedia.org_wiki_Category-3AScans-5Ffrom-5FGoogle-5FBooks&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=uBJm-kA1p9k5L0A-sHOD6GqHFy-vE27oP2VWKSFio6A&e=>,
> often to do their own OCR and proofreading on Wikisource). I'm just trying
> to help you understand that the restrictions on public access that you
> attempt to impose here under the label of "public-interest leverage" - i.e.
> your own institution retaining control over the content so you can monetize
> it - are likely to be seen as unacceptable by the open content movement.
>
> Another point you should be aware of is that while Wikimedia volunteers
> spend a lot of time diligently enforcing the copyrights of third parties
> (by deleting infringing material uploaded to Wikimedia projects), they
> explicitly reject<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__commons.wikimedia.org_wiki_Commons-3ANon-2Dcopyright-5Frestrictions&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=CeNO69oXZGBF0-ADSqSmYh4lg25bE8xKOQgopkFmdkM&e=>
> enforcing non-copyright terms imposed by such third parties.
>
> Lastly, a question:
> You say here<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org_posts_open-2Dcall-2Dfor-2Dcollaborators&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=RfzlyHhoZMsN_ioUOaaQK5txDp4TuEEMuRdIwLmfHuU&e=>
> that you (the Institutional Data Initiative) are "one of the
> Harvard-affiliated beneficiaries of OpenAI's new NextGenAI consortium". Is
> OpenAI also one of your customers paying for privileged access to the
> Institutional Books dataset (while your terms exclude the general public
> from it for the time being)?
> I'm not arguing that OpenAI is evil per se, or that academic institutions
> and GLAMs must never collaborate with Big Tech companies. (After all,
> Google Books, which your project is based on, was such a collaboration
> between Big Tech and academic libraries in the first place. And many
> Wikipedians can testify to its great value and usefulness for the general
> public.) However, the obfuscatory language in your post here regarding
> commercial partnerships and monetization ("garnering support from
> commercial actors as we iterate on sustainability"), combined with vague
> gesturing at a possible time-delayed free release at an undetermined point
> in the future, doesn't exactly inspire trust in this matter. If the project
> provides more transparent information about this question elsewhere, feel
> free to provide pointers. It would also be interesting to learn how much
> revenue the Institutional Data Initiative projects to derive from this
> monetization of public domain works.
>
> Regards, Tilman ([[User:HaeB]])
>
> On Mon, Jul 7, 2025 at 7:32 AM Leppert, Greg <[email protected]
> <mailto:[email protected]>> wrote:
> Hi all. Great to meet you and thank you to Leila for inviting me to join
> the list. I’m the Executive Director of the Institutional Data Initiative<
> https://www.institutionaldatainitiative.org<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=XuvnyK0eUIPnMcHH9dXe47fE7PSsmqlq0l2RTbB-HRA&e=>>
> (IDI) at Harvard and I wanted to share our recent data
> release—Institutional Books<
> https://www.institutionaldatainitiative.org/institutional-books<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org_institutional-2Dbooks&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=W3zGmg5QWmuXxuCceBv_1OdozrPwAp-LH-sJoiIs3P0&e=>>,
> a collection of nearly 1M public domain books, scanned at Harvard Library
> through the Google Books project.
>
> IDI works with libraries and other knowledge institutions to publish their
> collections as data with the goal of establishing public-interest leverage
> in the AI ecosystem while improving collections for traditional patron
> usage. With each project, we look for novel ways to structure and analyze
> the collection and set standards along the way. With Institutional Books,
> we tackled language analysis, topic classification, and OCR correction, and
> our technical report<https://arxiv.org/abs/2506.08300<
> https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_2506.08300&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=n3I0Od7D9ojL2r-L8LBGHvKSmDmu6jkdG7V_kB6Yqdw&e=>>
> has even more. We hope to evolve the collection over time and release new
> formats as we go, such as EPUB and Markdown.
>
> We’re also using this moment to experiment with a time-bounded Terms of
> Service that attempts to privilege open and noncommercial actors while
> garnering support from commercial actors as we iterate on sustainability.
> The goal is to eventually make the collection and all of its scans
> available under a more traditional open model.
>
> Thoughts, questions, and collaboration welcomed. We also have a Slack
> where we’re talking about this collection and others. Or next project is to
> dig in on a new collection of old newspapers, in collaboration with Boston
> Public Library, as we work toward building a global commons.
>
> —Greg
> _______________________________________________
> Wiki-research-l mailing list -- [email protected]
> <mailto:[email protected]>
> To unsubscribe send an email to [email protected]
> <mailto:[email protected]>
> _______________________________________________
> Wiki-research-l mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
Wiki-research-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wiki-research-l] Re: Institutional Books

Reply via email to