Hello Tilman, Thank you for your explanations concerning NC. This reminded me of a discussion during the "Strategy" talks, when the strategy group tried to introduce NC and ND provisions in the Wikimedia wikis. The negative reactions were... quite notable. https://meta.wikimedia.org/wiki/Talk:Movement_Strategy/Recommendations/Iteration_1/Diversity/9
Kind regards Ziko Am Mo., 25. Aug. 2025 um 19:43 Uhr schrieb Leppert, Greg < [email protected]>: > Thanks for sharing your thoughts, Tilman. I fear I should have mentioned > that our goal isn’t to replicate or strictly adhere to existing movements. > Given that I’m new to the list, I wouldn’t be surprised to learn that that > means my original message was off topic or, at the least confusing. > Apologies if so! Regardless, I appreciate you taking the time to articulate > your perspective. > > —Greg > > On Aug 24, 2025, at 2:41 AM, Tilman Bayer <[email protected]> wrote: > > > Hi Greg (CCing the "Wikimedia & GLAM collaboration" mailing list), > > First, as there has been no reaction here yet: Congrats to you and Harvard > Law School Library on this release! A dataset of one million > high-quality-OCR public domain books sounds very impressive. > > However, your message here, and in particular its highlighting of > "time-bounded Terms of Service that attempts to privilege open and > noncommercial actors", give the distinct impression that you are unaware of > some central aspects of Wikipedia and the Wikimedia movement, or indeed the > wider free-culture movement as well. While the Wikimedia Foundation is > indeed a nonprofit organization, and Wikipedia and the other Wikimedia > projects are indeed noncommercial, they have never accepted content > licenses or terms that are confined to "open and noncommercial actors". So > let me link some explanatory material: > > The Wikimedia Foundation's licensing policy< > https://urldefense.proofpoint.com/v2/url?u=https-3A__foundation.wikimedia.org_wiki_Resolution-3ALicensing-5Fpolicy&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=4CToRMIrKQ3HCmLZAJKRIum1vGPBNPIdE5t2QjBC4NQ&e=> > (which governs the content on Wikipedia and all other Wikimedia projects) > relies on a definition of "free content" that excludes licenses limited to > noncommercial usage, like your terms are. Summarizing the rationales for > this long-standing decision would go too far here - if you are interested > in those, this< > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_hyperkitty_list_wikimedia-2Dl-40lists.wikimedia.org_thread_JUNXXJPIZRMCFAPNJEGXPPENCOS6DOQW_&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=r0rL62RAxpoKD9V5YY4kQoMD2oFJbu_j5o2l98M2JDk&e=> > might be a good starting point. But to highlight one well-known problem > with such licenses (in particularly the -NC variants of the Creative > Commons licenses), because it may help to illustrate some especially > problematic restrictions that you/your lawyers attempt to impose: People > have found out time and again that it is difficult to actually define > commercial usage, in a way that doesn't have unintended consequences. (E.g. > could a hobbyist blogger be sued for using an NC-licensed image because her > blog features some Google ads?) Creative Commons even ran a whole study in > an attempt to retroactively clarify such boundaries. > > But in any case, despite these well-documented complications, legal > restrictions about the commercial *usage* of particular material still seem > more straightforward to figure out than the restrictions on "intent" and > "affiliation" of the *user* that you (or Harvard's lawyers?) try to impose > in the terms of use for this release< > https://urldefense.proofpoint.com/v2/url?u=https-3A__huggingface.co_datasets_institutional_institutional-2Dbooks-2D1.0&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=oQ_QN9dHgzL1N5pIBzwn5WRtEg2_FoXpHCbjLMNppWo&e= > >: > "Open-source projects and other public-use efforts are welcome, even if > they may indirectly support commercial use, so long as they are > unaffiliated with commercial actors or intent." > Your requirement that an open source project must not even be "affiliated" > with "commercial ... intent", would likely exclude, say, the majority of > widely used (e.g. by Wikimedia organizations< > https://urldefense.proofpoint.com/v2/url?u=https-3A__meta.wikimedia.org_wiki_FLOSS-2DExchange&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=N-EVfNvmoFS4D3NB2tTVDnPd33XfjLYgACHgbnkOyRA&e=>) > open source software projects, which are frequently either maintained by a > commercial company, or by volunteers who also have a related day job as > developer or may offer paid support. Even the most anticapitalist purists > in the free software movement shy away from such restrictions in their > licenses. > In any case, we can be pretty sure that your clause rules out the > Wikimedia Foundation, as it is not just "affiliated" with a commercial > actor but has one directly incorporated as a subsidiary, namely the > for-profit Wikimedia LLC. You don't seem to be aware of this, given that > you came here with the apparent impression that an offer to "privilege open > and noncommercial actors" may enable a cooperation. > > The second clause of your terms< > https://urldefense.proofpoint.com/v2/url?u=https-3A__x.com_tilmanbayer_status_1933311788688552165&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=VnAmO8pRCXtgKALbw8W4yngG4kn10c0_d8fDMClihUA&e=> > ("No Redistribution") is likewise a non-starter for "open actors" - it is > almost the definition of non-open. > > I do realize of course that there will be many AI/ML folks on HF and > elsewhere who are happy to use such a dataset while blissfully ignore such > attempts to impose restrictions on public domain content, perhaps assuming > - possibly correctly - that you didn't think these terms of use through > very thoroughly and are thus unlikely to enforce them, or who are simply > not yet as familiar with the long-term effects of such legal footguns as > Wikimedians and FLOSS developers have become over many years. That said, > I've seen your terms cause consternation in the open AI/ML world too, e.g. > on the EleutherAI Discord. > > You should also be aware that in the history of the Wikimedia movement > there have been some some ugly legal disputes with GLAMs (galleries, > libraries, archives and museums, i.e. organizations like yours) who > attempted to restrict reproduction of public domain works in their > possession with similar rationales (i.e. an alleged need to extract revenue > to refinance digitization efforts or such, which I hear echoing in your > vague remarks about "sustainability" "ecosystem" etc). Two examples: > > * > https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_Wikimedia_Foundation_copyright_dispute > < > https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_National-5FPortrait-5FGallery-5Fand-5FWikimedia-5FFoundation-5Fcopyright-5Fdispute&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=5NY2mQ8YwzqP-WGeyI5hKk7ToNWeLxSRIyI6NkvT_WA&e= > > > * > https://en.wikipedia.org/wiki/Reiss_Engelhorn_Museum#Wikimedia_lawsuit< > https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Reiss-5FEngelhorn-5FMuseum-23Wikimedia-5Flawsuit&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=CxXbfJ0IrJJN5xz3SBkDqnoaAOtO6zaw5sF7tGwUiRI&e=> > (While that museum prevailed in court against the Wikimedia Foundation, the > EU Copyright Directive subsequently made such assertions of copyright over > faithful reproductions of public domain works impossible.) > > I'm not saying that the Institutional Books project is likely to become > similarly contentious (if only for the simple reason that Wikimedians have > long already been importing the same underlying Google Books scans< > https://urldefense.proofpoint.com/v2/url?u=https-3A__commons.wikimedia.org_wiki_Category-3AScans-5Ffrom-5FGoogle-5FBooks&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=uBJm-kA1p9k5L0A-sHOD6GqHFy-vE27oP2VWKSFio6A&e=>, > often to do their own OCR and proofreading on Wikisource). I'm just trying > to help you understand that the restrictions on public access that you > attempt to impose here under the label of "public-interest leverage" - i.e. > your own institution retaining control over the content so you can monetize > it - are likely to be seen as unacceptable by the open content movement. > > Another point you should be aware of is that while Wikimedia volunteers > spend a lot of time diligently enforcing the copyrights of third parties > (by deleting infringing material uploaded to Wikimedia projects), they > explicitly reject< > https://urldefense.proofpoint.com/v2/url?u=https-3A__commons.wikimedia.org_wiki_Commons-3ANon-2Dcopyright-5Frestrictions&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=CeNO69oXZGBF0-ADSqSmYh4lg25bE8xKOQgopkFmdkM&e=> > enforcing non-copyright terms imposed by such third parties. > > Lastly, a question: > You say here< > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org_posts_open-2Dcall-2Dfor-2Dcollaborators&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=RfzlyHhoZMsN_ioUOaaQK5txDp4TuEEMuRdIwLmfHuU&e=> > that you (the Institutional Data Initiative) are "one of the > Harvard-affiliated beneficiaries of OpenAI's new NextGenAI consortium". Is > OpenAI also one of your customers paying for privileged access to the > Institutional Books dataset (while your terms exclude the general public > from it for the time being)? > I'm not arguing that OpenAI is evil per se, or that academic institutions > and GLAMs must never collaborate with Big Tech companies. (After all, > Google Books, which your project is based on, was such a collaboration > between Big Tech and academic libraries in the first place. And many > Wikipedians can testify to its great value and usefulness for the general > public.) However, the obfuscatory language in your post here regarding > commercial partnerships and monetization ("garnering support from > commercial actors as we iterate on sustainability"), combined with vague > gesturing at a possible time-delayed free release at an undetermined point > in the future, doesn't exactly inspire trust in this matter. If the project > provides more transparent information about this question elsewhere, feel > free to provide pointers. It would also be interesting to learn how much > revenue the Institutional Data Initiative projects to derive from this > monetization of public domain works. > > Regards, Tilman ([[User:HaeB]]) > > On Mon, Jul 7, 2025 at 7:32 AM Leppert, Greg <[email protected] > <mailto:[email protected]>> wrote: > Hi all. Great to meet you and thank you to Leila for inviting me to join > the list. I’m the Executive Director of the Institutional Data Initiative< > https://www.institutionaldatainitiative.org< > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=XuvnyK0eUIPnMcHH9dXe47fE7PSsmqlq0l2RTbB-HRA&e=>> > (IDI) at Harvard and I wanted to share our recent data > release—Institutional Books< > https://www.institutionaldatainitiative.org/institutional-books< > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.institutionaldatainitiative.org_institutional-2Dbooks&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=W3zGmg5QWmuXxuCceBv_1OdozrPwAp-LH-sJoiIs3P0&e=>>, > a collection of nearly 1M public domain books, scanned at Harvard Library > through the Google Books project. > > IDI works with libraries and other knowledge institutions to publish their > collections as data with the goal of establishing public-interest leverage > in the AI ecosystem while improving collections for traditional patron > usage. With each project, we look for novel ways to structure and analyze > the collection and set standards along the way. With Institutional Books, > we tackled language analysis, topic classification, and OCR correction, and > our technical report<https://arxiv.org/abs/2506.08300< > https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_2506.08300&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=1ZTaULgVbVyeSn4U0L_JEcsseyi_Yiqm36k29morqSI&m=-5xvRvwCD3ORjQah9d_mRfwxxNH8Ka8gJ0Ydb-pDL5SLMBpnFLzfD2_1rQ2URDGj&s=n3I0Od7D9ojL2r-L8LBGHvKSmDmu6jkdG7V_kB6Yqdw&e=>> > has even more. We hope to evolve the collection over time and release new > formats as we go, such as EPUB and Markdown. > > We’re also using this moment to experiment with a time-bounded Terms of > Service that attempts to privilege open and noncommercial actors while > garnering support from commercial actors as we iterate on sustainability. > The goal is to eventually make the collection and all of its scans > available under a more traditional open model. > > Thoughts, questions, and collaboration welcomed. We also have a Slack > where we’re talking about this collection and others. Or next project is to > dig in on a new collection of old newspapers, in collaboration with Boston > Public Library, as we work toward building a global commons. > > —Greg > _______________________________________________ > Wiki-research-l mailing list -- [email protected] > <mailto:[email protected]> > To unsubscribe send an email to [email protected] > <mailto:[email protected]> > _______________________________________________ > Wiki-research-l mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ Wiki-research-l mailing list -- [email protected] To unsubscribe send an email to [email protected]
