[MCN-L] RESPONSES: Low-cost digitization rig (Cherry, Rich)

Cherry, Rich Sat, 9 Aug 2008 22:25:23 -0700

A medium size museum might conservatively have 5000 boxes of archival
material with lets say very conservatively 2000 pages per box, or 10
million pages to digitize, many of which are fragile or subject to
copyright.  Many organizations this size struggle to have folder level
documentation on their collection of boxes.  So let an intern start
scanning... lets say they are amazing and they can scan 500 pages a day,
5 days a week.  At that amazing rate you will be done in about 75 years
if your institution does not create another 10 million pages in those 75
years.


However lets say you invest in 6 staff for 2 years and find the most
important documents in the 10 million pages... lets say they find 50,000
of them.  You scan them, organize them, catalog them, get permission to
republish online.  This results in something that the institution,
scholars and the public can use immediately.

I think the second scenario is how a museum person would approach the
problem you are trying to solve.  The best solution is probably
somewhere in the middle but it inevitably requires staff support to
succeed.

Rich  

-----Original Message-----
From: mcn-l-bounces at mcn.edu [mailto:mcn-l-boun...@mcn.edu] On Behalf Of
Christopher J. Mackie
Sent: Friday, August 08, 2008 2:54 PM
To: mcn-l at mcn.edu
Subject: Re: [MCN-L] RESPONSES: Low-cost digitization rig (Cherry, Rich)

Rich; responses to your questions/concerns inline.

----------------------------------------------------------------------
From: "Cherry, Rich" <rche...@skirball.org>

<snip>

Who leads the project (gets the institution behind it, finds the free
labor source, finds space, organizes tasks and manages a schedule)?

> One goal is to reduce the costs, space, and skill-demands
substantially enough that this becomes far less challenging. The
software has workflow and project management capabilities inherent. 
 
Who selects the material? Who moves the material to the location for
scanning (is the free labor a security issue)? Who reviews the material
to see if there are copyright issues?

> All good questions. Remember that we're not trying to reproduce the
Million Book Project; the goal is to help with lots of small
collections, for which, taken individually, these questions are not
impossibly intimidating. 
        
Who proofs the final product to see if errors were made?

> The software supports real-time QA for common errors; some additional
work might be required, presumably by a staffer. How much work that is
depends on the quality of the source, etc. The new software should
reduce the QA load as compared to anything else we've seen. If you're
scanning ordinary books of reasonable quality, the staffer's effort
should be minimal. Fixing OCR is, of course, another story. 

Where will the product live when the funding for online archives
disappears?

> The product will support one-button archiving online; if you're OK
with Internet Archive as a host, this problem is solved. Proprietary
content is your problem. 

If there is no cataloging for access other than the OCR is the only use
a huge repository of unconnected individual pages or if its books and
collections who catalogs them and connects them?

> The software automatically structures documents, including books and
collections; one of its improvements over commercial OCR (both accuracy
and usability) is that it's *designed* for compound docs, as well as
individual pages. How much human effort is required is a function of how
much individualized metadata entry you want to do; the system will
automate all the batch stuff, but if you want to markup each word, you
can. 

I do think that the online archive piece might move a few organizations
closer to doing it.  It might even be more attractive if some of the OCR
processing took place there as well. Is part of the plan to use
something like Amazon Web Services for this?

> We've been talking about this. It's possible one of the 'big'
digitizers might be willing to do remote OCR--but we're focusing on
small projects, and the OCR runs fine on a laptop, so I'm not sure why
this is necessary? Remember that we're not trying to put Google out of
business; we're trying to help with materials that wouldn't make it to
Google or IA on their own.

--Chris
_______________________________________________
You are currently subscribed to mcn-l, the listserv of the Museum
Computer Network (http://www.mcn.edu)

To post to this list, send messages to: mcn-l at mcn.edu

To unsubscribe or change mcn-l delivery options visit:
http://toronto.mediatrope.com/mailman/listinfo/mcn-l

[MCN-L] RESPONSES: Low-cost digitization rig (Cherry, Rich)

Reply via email to