I used main branch of Oct 21 (last commit Oct. 13) on a scan of a book
from 1830. Then I created 32 cropped png files and corresponding text in
gt.txt files and used the make training with START_MODEL=eng to create a
new model.
The book has over 200 unique misspellings or archaic spellings and
hundreds of uncommon proper names. I used -c load_system_dawg=F
-c load_freq_dawg=F But I noticed no difference with our without
that. I didn't want to use any dictionary decisions as I want to keep
all original spellings.
I ended up creating a word diff between the eng model and my model,
then I manually visually reviewed the entire book, with over 6000
differences which included much noise. So average over 10 changes to
compare for every page which I manually reviewed and edited. Such as:
cause of the greatness of the [-multitude; therefore,-] {+mutltitude;
thercfore,+} he [-caused-] {+cansed+}
As I manually reviewed those I found other problems that were not
detected by tesseract using either model which I fixed in my
transcription. Then I compared with a third-party transcription of the
same book -- that was difficult because I found over 100 mistakes in
that transcription plus books in the 1800's may have in-press changes
where things are changed during the printings so the same book may have
various corrections (I found over 60) and damage (like type fell out or
ink? spots, I found many). I spent maybe a hundred hours on this.
I had zero pages that recognized perfectly, but I did have maybe 1% of
the pages that the eng model and my custom model recognized the same
(which I found were wrong by manual review or comparison with another
transcription).
Is there a simple way to do training using page by page instead of
cropping out line by line? Now that I have a near 600 pages
transcriptions with images, I'd like to train it on all that. (Then I
may attempt to transcribe some other 1800's books.)
I also used a tessedit_char_whitelist that only has ascii characters
plus an em dash. Still I had many outputs like "Detected 523 diacritics"
and has much noise which took me hours to manually clean out. How can I
get tesseract to not output content related to the "Detected ...
diacritics" that looks like the following?
i la p AV EU
E o a r a -
t
S rmi P V CF jim E
ar
DD pi-ay pa . f
If it thinks it is a diacritics is there a way to tell tesseract to not
output it?
Another odd behaviour I saw is that it repeated many characters like:
"m" output as "mn" or "nm"
"h" as "hu"
"had" as "bhad"
"wn" as "whn"
I think tesseract thinks a single character looks like two of the
choices which makes sense, but it also outputs both.
Anyways, thank you much for your software. It has been quite interesting
to learn and use.
Jeremy
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/8fe96734-e6b3-3ec9-108c-754bc5a8c543%40reedmedia.net.