Re: [tesseract-ocr] Re: Using tesseract_best (or other models?) for 18th-century English printed text

Graham Toal Mon, 21 Apr 2025 14:34:34 -0700

On Mon, Apr 21, 2025 at 2:02 PM RuePat07 <[email protected]>
wrote:


> Try preprocessing your documents. Create a black and white image first and
> crop the images for text area. Try to enhance the text by thresholding. In
> my experience i have seen tesseract do not so well when there are stray
> lines or boxes. You can also experiment with different psm modes, i found
> changing them to be useful in my application. You could also finetune the
> eng/latin model if all the documents are in a similar font for that font.
>

Actually that document looked like one of the ones that has been prepared
with whatever tool it is that creates 3 layers for every page, and one of
those layers is the text only layer in grey scale, with the background
already removed (although it is inverted white on black which is easily
fixed).  You can extract those images from the file and keep every third
one which will be the text.  I don't know which tool is creating pdfs in
this format, but it's similar to the way that Deja Vu originally pioneered
separating the background and replacing it with a more compact version.
I've seen it in files from both Google Books and archive.org.  In my
current project, this was all I found necessary to add to those extracted
layers - basically just removing a little noise:
    convert \
        $1 \
        -write MPR:source \
        -morphology close rectangle:3x4 \
        -clip-mask MPR:source \
        -morphology erode:8 square \
        +clip-mask \
        scan_intermediate.jpg
    convert scan_intermediate.jpg -shave 150x150 -fuzz 20% -trim +repage
../images/$1
btw while I'm posting... some 'gotchas' to look out for which I've come
across myself recently when OCRing and proofreading similar 18th and 19th C
documents, some of which were due to the typesetter substituting what was
available for a less common character: the actual letter 'f' substituted
for the long medial s; 'y' substituted for thorn - the old style thorn that
looks like a y or a gamma, not the representation used by UTF-8 that looks
somewhat like a p or b or beta. (example: for the using þe way of
witchcraft of moudiwart's feet upon him in his purse given to him þe Satan
for the cause that sa lang as he had them upon him he sould never want
siller.), the which is frequently erroneously rendered (and mispronounced)
as 'ye'. An apostrophe being used in Scottish names like M`Donald in place
of a superscript 'c'.  Various ligatures that you don't see much nowadays
(eg ct).  Much more common uses of superscripts where in modern times we'd
use an apostrophe to denote missing letters before the word-final cluster
of letters.  u for v and vice-versa.  Qu for W.  Thin spaces before some
punctuation (caused by mechanical issues with the type, eg ' ;'  which
should be OCR'd as just ';'.)  More common use of ligatures (eg Æneas).
Use of the old style '&' which looks more like the letters "Et".  Use of
accents that you might not be expecting and might dismiss as bad OCR, eg
"We hairtlie thank thé Hevinlie Father". Use of vulgar fractions with a
horizontal bar which cannot be represented in UTF-8 which only supports a
diagonal bar.  The old letter yogh which is written with a descender and
often rendered as (and similarly mispronounced as) 'z' as in the surname
'Menzies' which is pronounced 'meengis' - the name of American jazz
musician Charlie Mingus actually preserves the pronunciation but not the
spelling of the original name of Menzies.  Few of these are caught by
tesseract and will require manual proofreading.

Good luck with your project.

Graham

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CABwQhLmKesS8PJa%2BM7o75oV%3DW9tm4L-9P62kGOMj8MZLDiLBnw%40mail.gmail.com.

Re: [tesseract-ocr] Re: Using tesseract_best (or other models?) for 18th-century English printed text

Reply via email to