Have you solve it or not yet , I may suggest a combination of tesseract and ai . Normally I try tesseract first, I write some python scripts to enhance or prepare the documents and use pystract, if it did not work I use ai model to correct the mistakes. If you can not do it and there is no private documents or info send me the one you need to extract and I will help in my free time or I will try with some pages and tell you what script to use and which model to aid you in the process. Best of luck
On Mon, Apr 21, 2025, 11:34 PM Graham Toal <[email protected]> wrote: > On Mon, Apr 21, 2025 at 2:02 PM RuePat07 <[email protected]> > wrote: > >> Try preprocessing your documents. Create a black and white image first >> and crop the images for text area. Try to enhance the text by thresholding. >> In my experience i have seen tesseract do not so well when there are stray >> lines or boxes. You can also experiment with different psm modes, i found >> changing them to be useful in my application. You could also finetune the >> eng/latin model if all the documents are in a similar font for that font. >> > > Actually that document looked like one of the ones that has been prepared > with whatever tool it is that creates 3 layers for every page, and one of > those layers is the text only layer in grey scale, with the background > already removed (although it is inverted white on black which is easily > fixed). You can extract those images from the file and keep every third > one which will be the text. I don't know which tool is creating pdfs in > this format, but it's similar to the way that Deja Vu originally pioneered > separating the background and replacing it with a more compact version. > I've seen it in files from both Google Books and archive.org. In my > current project, this was all I found necessary to add to those extracted > layers - basically just removing a little noise: > convert \ > $1 \ > -write MPR:source \ > -morphology close rectangle:3x4 \ > -clip-mask MPR:source \ > -morphology erode:8 square \ > +clip-mask \ > scan_intermediate.jpg > convert scan_intermediate.jpg -shave 150x150 -fuzz 20% -trim +repage > ../images/$1 > btw while I'm posting... some 'gotchas' to look out for which I've come > across myself recently when OCRing and proofreading similar 18th and 19th C > documents, some of which were due to the typesetter substituting what was > available for a less common character: the actual letter 'f' substituted > for the long medial s; 'y' substituted for thorn - the old style thorn that > looks like a y or a gamma, not the representation used by UTF-8 that looks > somewhat like a p or b or beta. (example: for the using þe way of > witchcraft of moudiwart's feet upon him in his purse given to him þe Satan > for the cause that sa lang as he had them upon him he sould never want > siller.), the which is frequently erroneously rendered (and mispronounced) > as 'ye'. An apostrophe being used in Scottish names like M`Donald in place > of a superscript 'c'. Various ligatures that you don't see much nowadays > (eg ct). Much more common uses of superscripts where in modern times we'd > use an apostrophe to denote missing letters before the word-final cluster > of letters. u for v and vice-versa. Qu for W. Thin spaces before some > punctuation (caused by mechanical issues with the type, eg ' ;' which > should be OCR'd as just ';'.) More common use of ligatures (eg Æneas). > Use of the old style '&' which looks more like the letters "Et". Use of > accents that you might not be expecting and might dismiss as bad OCR, eg > "We hairtlie thank thé Hevinlie Father". Use of vulgar fractions with a > horizontal bar which cannot be represented in UTF-8 which only supports a > diagonal bar. The old letter yogh which is written with a descender and > often rendered as (and similarly mispronounced as) 'z' as in the surname > 'Menzies' which is pronounced 'meengis' - the name of American jazz > musician Charlie Mingus actually preserves the pronunciation but not the > spelling of the original name of Menzies. Few of these are caught by > tesseract and will require manual proofreading. > > Good luck with your project. > > Graham > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/CABwQhLmKesS8PJa%2BM7o75oV%3DW9tm4L-9P62kGOMj8MZLDiLBnw%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CABwQhLmKesS8PJa%2BM7o75oV%3DW9tm4L-9P62kGOMj8MZLDiLBnw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAK6ABBavhO%2BwxfnwquQ_tWbdy-95ZJDzsUQduwn%3D7bC3SJ860g%40mail.gmail.com.

