Re: [ol-discuss] Recording the quality of a book's OCR

Edward Betts Thu, 29 Dec 2011 12:03:01 -0800

We don't currently have a system for recording the quality of the OCR or 
correcting mistakes.


As you point out the OCR doesn't properly handle blackletter type.

A system for correcting OCR is often requested, conceptually it is quite 
simple. Just a web page that shows the page image and a way to edit the 
text. We keen to maintain page coordinate information for each word so 
that we can highlight words in the book reader and search inside. This 
makes the problem more difficult.

We would like to build a correction system, but we don't have the resources.

-- 
Edward.

On 2011-10-03 09:39, Laurence Penney wrote:
> I've been wondering about methods for indicating quality claims for 
> electronic book transcription.
>
> Let's say we have an OCR'd PDF, such as this one:
>
> http://ia700408.us.archive.org/24/items/spoonriveranthol00mastiala/spoonriveranthol00mastiala.pdf
>
> The text on the title page is easy for me to type in:
>
> ----
> COPYRIGHT, 1914 AND 1915,
> BY WILLIAM MARION REEDY.
> COPYRIGHT, 1915 AND 1916,
> BY THE MACMILLAN COMPANY.
> Set up and electrotyped. Published April, 1915.
> Norwood Press
> J. S. Cushing Co. — Berwick&  Smith Co.
> Norwood, Mass., U.S.A.
> ----
>
> But the text as copied from OS X Preview, is this:
>
> ----
> COPYRIGHT, 1914 AND 1915, BY WILLIAM MARION REEDY.
> COPYRIGHT, 1915 AND 1916, BY THE MACMILLAN COMPANY.
> up and electrotyped. Published April, 1915.
> NortoonU tyrezs J. 8. Gushing Co.     Berwick&  Smith C. Norwood, Maas., 
> U.S.A.
> ----
>
> This seems to me to be pretty poor. The publisher information is barely 
> recognizable. An entire word ("Set") has been ignored. Blackletter type has 
> totally confused the OCR. Line breaks are missed.
>
> Is there any standard practice for measuring the quality of an OCR 
> transcription? Or any other transcription? For example, a random full page of 
> text could be proofread and given a score, which could be tagged onto the 
> digital text. OCR engine makers would have a handy library of problematic 
> texts.
>
> It would at least be good to be able to mark those texts that have been 
> thoroughly checked - something that any important edition surely deserves.
>
> And it would be good to mark those which have failed, such as this:
>
> http://books.google.com/books?id=IrY9AAAAcAAJ&pg=PT41#v=onepage&q&f=false
>
> Note that Google doesn't seem to understand the long s - ſ - transcribing it 
> as f. Search that book above for ipſe, ipse and ipfe.
>
> Any thoughts?
>
> - L
>
> _______________________________________________
> Ol-discuss mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
> To unsubscribe from this mailing list, send email to 
> [email protected]

_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to