Re: [tesseract-ocr] Re: Using tesseract_best (or other models?) for 18th-century English printed text

Mahmoud Mohamed Sat, 24 May 2025 16:31:10 -0700

Have you solve it or not yet , I may suggest a combination of tesseract and
ai . Normally I try tesseract first, I write some python scripts to enhance
or prepare the documents and use pystract, if it did not work I use ai
model to correct the mistakes.
If you can not do it and there is no private documents or info send me the
one you need to extract and I will help in my free time or I will try with
some pages and tell you what script to use and which model to aid you in
the process. Best of luck


On Mon, Apr 21, 2025, 11:34 PM Graham Toal <[email protected]> wrote:

> On Mon, Apr 21, 2025 at 2:02 PM RuePat07 <[email protected]>
> wrote:
>
>> Try preprocessing your documents. Create a black and white image first
>> and crop the images for text area. Try to enhance the text by thresholding.
>> In my experience i have seen tesseract do not so well when there are stray
>> lines or boxes. You can also experiment with different psm modes, i found
>> changing them to be useful in my application. You could also finetune the
>> eng/latin model if all the documents are in a similar font for that font.
>>
>
> Actually that document looked like one of the ones that has been prepared
> with whatever tool it is that creates 3 layers for every page, and one of
> those layers is the text only layer in grey scale, with the background
> already removed (although it is inverted white on black which is easily
> fixed).  You can extract those images from the file and keep every third
> one which will be the text.  I don't know which tool is creating pdfs in
> this format, but it's similar to the way that Deja Vu originally pioneered
> separating the background and replacing it with a more compact version.
> I've seen it in files from both Google Books and archive.org.  In my
> current project, this was all I found necessary to add to those extracted
> layers - basically just removing a little noise:
>     convert \
>         $1 \
>         -write MPR:source \
>         -morphology close rectangle:3x4 \
>         -clip-mask MPR:source \
>         -morphology erode:8 square \
>         +clip-mask \
>         scan_intermediate.jpg
>     convert scan_intermediate.jpg -shave 150x150 -fuzz 20% -trim +repage
> ../images/$1
> btw while I'm posting... some 'gotchas' to look out for which I've come
> across myself recently when OCRing and proofreading similar 18th and 19th C
> documents, some of which were due to the typesetter substituting what was
> available for a less common character: the actual letter 'f' substituted
> for the long medial s; 'y' substituted for thorn - the old style thorn that
> looks like a y or a gamma, not the representation used by UTF-8 that looks
> somewhat like a p or b or beta. (example: for the using þe way of
> witchcraft of moudiwart's feet upon him in his purse given to him þe Satan
> for the cause that sa lang as he had them upon him he sould never want
> siller.), the which is frequently erroneously rendered (and mispronounced)
> as 'ye'. An apostrophe being used in Scottish names like M`Donald in place
> of a superscript 'c'.  Various ligatures that you don't see much nowadays
> (eg ct).  Much more common uses of superscripts where in modern times we'd
> use an apostrophe to denote missing letters before the word-final cluster
> of letters.  u for v and vice-versa.  Qu for W.  Thin spaces before some
> punctuation (caused by mechanical issues with the type, eg ' ;'  which
> should be OCR'd as just ';'.)  More common use of ligatures (eg Æneas).
> Use of the old style '&' which looks more like the letters "Et".  Use of
> accents that you might not be expecting and might dismiss as bad OCR, eg
> "We hairtlie thank thé Hevinlie Father". Use of vulgar fractions with a
> horizontal bar which cannot be represented in UTF-8 which only supports a
> diagonal bar.  The old letter yogh which is written with a descender and
> often rendered as (and similarly mispronounced as) 'z' as in the surname
> 'Menzies' which is pronounced 'meengis' - the name of American jazz
> musician Charlie Mingus actually preserves the pronunciation but not the
> spelling of the original name of Menzies.  Few of these are caught by
> tesseract and will require manual proofreading.
>
> Good luck with your project.
>
> Graham
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/CABwQhLmKesS8PJa%2BM7o75oV%3DW9tm4L-9P62kGOMj8MZLDiLBnw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CABwQhLmKesS8PJa%2BM7o75oV%3DW9tm4L-9P62kGOMj8MZLDiLBnw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAK6ABBavhO%2BwxfnwquQ_tWbdy-95ZJDzsUQduwn%3D7bC3SJ860g%40mail.gmail.com.

Re: [tesseract-ocr] Re: Using tesseract_best (or other models?) for 18th-century English printed text

Reply via email to