Re: [ol-discuss] Recording the quality of a book's OCR

Roger Loran Bailey Mon, 03 Oct 2011 10:34:00 -0700

I am a volunteer for Bookshare.org and when you submit a scanned book to be 
processed into an ebook at that site they do have automatic tools that 
analyse the quality and can give statistics on the results. I am far from a 
techie and do not understand most of it, but you might be able to get some 
information about it if you contact Bookshare.



_     _      _

"The meme for blind faith secures its own perpetuation by the simple 
unconscious expedient of discouraging rational inquiry." - Richard Dawkins


Follow me on Twitter: http://twitter.com/rogerbailey81


The Militant:
 http://www.themilitant.com
Pathfinder Press:
 http://www.pathfinderpress.com
Granma International:
  http://www.granma.cu/ingles/index.html
----- Original Message ----- 
From: "Laurence Penney" <[email protected]>
To: "Open Library -- general discussion" <[email protected]>
Sent: Monday, October 03, 2011 12:39 PM
Subject: [ol-discuss] Recording the quality of a book's OCR


> I've been wondering about methods for indicating quality claims for 
> electronic book transcription.
>
> Let's say we have an OCR'd PDF, such as this one:
>
> http://ia700408.us.archive.org/24/items/spoonriveranthol00mastiala/spoonriveranthol00mastiala.pdf
>
> The text on the title page is easy for me to type in:
>
> ----
> COPYRIGHT, 1914 AND 1915,
> BY WILLIAM MARION REEDY.
> COPYRIGHT, 1915 AND 1916,
> BY THE MACMILLAN COMPANY.
> Set up and electrotyped. Published April, 1915.
> Norwood Press
> J. S. Cushing Co. — Berwick & Smith Co.
> Norwood, Mass., U.S.A.
> ----
>
> But the text as copied from OS X Preview, is this:
>
> ----
> COPYRIGHT, 1914 AND 1915, BY WILLIAM MARION REEDY.
> COPYRIGHT, 1915 AND 1916, BY THE MACMILLAN COMPANY.
> up and electrotyped. Published April, 1915.
> NortoonU tyrezs J. 8. Gushing Co. Berwick & Smith C. Norwood, Maas., 
> U.S.A.
> ----
>
> This seems to me to be pretty poor. The publisher information is barely 
> recognizable. An entire word ("Set") has been ignored. Blackletter type 
> has totally confused the OCR. Line breaks are missed.
>
> Is there any standard practice for measuring the quality of an OCR 
> transcription? Or any other transcription? For example, a random full page 
> of text could be proofread and given a score, which could be tagged onto 
> the digital text. OCR engine makers would have a handy library of 
> problematic texts.
>
> It would at least be good to be able to mark those texts that have been 
> thoroughly checked - something that any important edition surely deserves.
>
> And it would be good to mark those which have failed, such as this:
>
> http://books.google.com/books?id=IrY9AAAAcAAJ&pg=PT41#v=onepage&q&f=false
>
> Note that Google doesn't seem to understand the long s - ſ - transcribing 
> it as f. Search that book above for ipſe, ipse and ipfe.
>
> Any thoughts?
>
> - L
>
> _______________________________________________
> Ol-discuss mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
> To unsubscribe from this mailing list, send email to 
> [email protected]
> 

_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to