[tesseract-ocr] training using a page at a time?

Jeremy C. Reed Sat, 08 Nov 2025 21:31:26 -0800

I used main branch of Oct 21 (last commit Oct. 13) on a scan of a bookfrom 1830. Then I created 32 cropped png files and corresponding text ingt.txt files and used the make training with START_MODEL=eng to create anew model.

The book has over 200 unique misspellings or archaic spellings andhundreds of uncommon proper names. I used -c load_system_dawg=F-c load_freq_dawg=F But I noticed no difference with our withoutthat. I didn't want to use any dictionary decisions as I want to keepall original spellings.


I ended up creating a word diff between the eng model and my model,

then I manually visually reviewed the entire book, with over 6000differences which included much noise. So average over 10 changes tocompare for every page which I manually reviewed and edited. Such as:


     cause of the greatness of the [-multitude; therefore,-] {+mutltitude; 
thercfore,+} he [-caused-] {+cansed+}

As I manually reviewed those I found other problems that were notdetected by tesseract using either model which I fixed in mytranscription. Then I compared with a third-party transcription of thesame book -- that was difficult because I found over 100 mistakes inthat transcription plus books in the 1800's may have in-press changeswhere things are changed during the printings so the same book may havevarious corrections (I found over 60) and damage (like type fell out orink? spots, I found many). I spent maybe a hundred hours on this.

I had zero pages that recognized perfectly, but I did have maybe 1% ofthe pages that the eng model and my custom model recognized the same(which I found were wrong by manual review or comparison with anothertranscription).

Is there a simple way to do training using page by page instead ofcropping out line by line? Now that I have a near 600 pagestranscriptions with images, I'd like to train it on all that. (Then Imay attempt to transcribe some other 1800's books.)

I also used a tessedit_char_whitelist that only has ascii charactersplus an em dash. Still I had many outputs like "Detected 523 diacritics"and has much noise which took me hours to manually clean out. How can Iget tesseract to not output content related to the "Detected ...diacritics" that looks like the following?


i la p AV EU

E o a r a -
t

S rmi P V CF jim E

ar

DD pi-ay pa . f

If it thinks it is a diacritics is there a way to tell tesseract to notoutput it?


Another odd behaviour I saw is that it repeated many characters like:

"m" output as "mn" or "nm"

"h" as "hu"

"had" as "bhad"

"wn" as "whn"

I think tesseract thinks a single character looks like two of thechoices which makes sense, but it also outputs both.

Anyways, thank you much for your software. It has been quite interestingto learn and use.


Jeremy

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/8fe96734-e6b3-3ec9-108c-754bc5a8c543%40reedmedia.net.

[tesseract-ocr] training using a page at a time?

Reply via email to