I am a volunteer for Bookshare.org and when you submit a scanned book to be processed into an ebook at that site they do have automatic tools that analyse the quality and can give statistics on the results. I am far from a techie and do not understand most of it, but you might be able to get some information about it if you contact Bookshare.
_ _ _ "The meme for blind faith secures its own perpetuation by the simple unconscious expedient of discouraging rational inquiry." - Richard Dawkins Follow me on Twitter: http://twitter.com/rogerbailey81 The Militant: http://www.themilitant.com Pathfinder Press: http://www.pathfinderpress.com Granma International: http://www.granma.cu/ingles/index.html ----- Original Message ----- From: "Laurence Penney" <[email protected]> To: "Open Library -- general discussion" <[email protected]> Sent: Monday, October 03, 2011 12:39 PM Subject: [ol-discuss] Recording the quality of a book's OCR > I've been wondering about methods for indicating quality claims for > electronic book transcription. > > Let's say we have an OCR'd PDF, such as this one: > > http://ia700408.us.archive.org/24/items/spoonriveranthol00mastiala/spoonriveranthol00mastiala.pdf > > The text on the title page is easy for me to type in: > > ---- > COPYRIGHT, 1914 AND 1915, > BY WILLIAM MARION REEDY. > COPYRIGHT, 1915 AND 1916, > BY THE MACMILLAN COMPANY. > Set up and electrotyped. Published April, 1915. > Norwood Press > J. S. Cushing Co. — Berwick & Smith Co. > Norwood, Mass., U.S.A. > ---- > > But the text as copied from OS X Preview, is this: > > ---- > COPYRIGHT, 1914 AND 1915, BY WILLIAM MARION REEDY. > COPYRIGHT, 1915 AND 1916, BY THE MACMILLAN COMPANY. > up and electrotyped. Published April, 1915. > NortoonU tyrezs J. 8. Gushing Co. Berwick & Smith C. Norwood, Maas., > U.S.A. > ---- > > This seems to me to be pretty poor. The publisher information is barely > recognizable. An entire word ("Set") has been ignored. Blackletter type > has totally confused the OCR. Line breaks are missed. > > Is there any standard practice for measuring the quality of an OCR > transcription? Or any other transcription? For example, a random full page > of text could be proofread and given a score, which could be tagged onto > the digital text. OCR engine makers would have a handy library of > problematic texts. > > It would at least be good to be able to mark those texts that have been > thoroughly checked - something that any important edition surely deserves. > > And it would be good to mark those which have failed, such as this: > > http://books.google.com/books?id=IrY9AAAAcAAJ&pg=PT41#v=onepage&q&f=false > > Note that Google doesn't seem to understand the long s - ſ - transcribing > it as f. Search that book above for ipſe, ipse and ipfe. > > Any thoughts? > > - L > > _______________________________________________ > Ol-discuss mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss > To unsubscribe from this mailing list, send email to > [email protected] > _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
