[Wiki-research-l] Institutional Books

Leppert, Greg Mon, 07 Jul 2025 07:32:06 -0700

Hi all. Great to meet you and thank you to Leila for inviting me to join the 
list. I’m the Executive Director of the Institutional Data 
Initiative<https://www.institutionaldatainitiative.org> (IDI) at Harvard and I 
wanted to share our recent data release—Institutional 
Books<https://www.institutionaldatainitiative.org/institutional-books>, a 
collection of nearly 1M public domain books, scanned at Harvard Library through 
the Google Books project.


IDI works with libraries and other knowledge institutions to publish their 
collections as data with the goal of establishing public-interest leverage in 
the AI ecosystem while improving collections for traditional patron usage. With 
each project, we look for novel ways to structure and analyze the collection 
and set standards along the way. With Institutional Books, we tackled language 
analysis, topic classification, and OCR correction, and our technical 
report<https://arxiv.org/abs/2506.08300> has even more. We hope to evolve the 
collection over time and release new formats as we go, such as EPUB and 
Markdown.

We’re also using this moment to experiment with a time-bounded Terms of Service 
that attempts to privilege open and noncommercial actors while garnering 
support from commercial actors as we iterate on sustainability. The goal is to 
eventually make the collection and all of its scans available under a more 
traditional open model.

Thoughts, questions, and collaboration welcomed. We also have a Slack where 
we’re talking about this collection and others. Or next project is to dig in on 
a new collection of old newspapers, in collaboration with Boston Public 
Library, as we work toward building a global commons.

—Greg
_______________________________________________
Wiki-research-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wiki-research-l] Institutional Books

Reply via email to