[tesseract-ocr] Advice on training for Old Amharic texts

Menelik Berhan Sat, 13 Jan 2024 00:21:26 -0800

*Background*
I'm trying to use tesseract 5.3.3 on scanned old books written in Amharic 
(which uses Ethiopic script).


*Major Shortcomings of amh.traineddata from tesseract*

*Difference in type of Ethiopic script:* there are Ethiopic script 
characters in old Amharic texts that are not used in the unicharset of 
amh.traineddata.

*Difference in punctuation styles:* the old texts use some punctuations not 
used in modern Amharic, and also for some that are used in modern Amharic, 
the old texts have d/t pattern (mostly space b/n word and punctuation 
character --- while the old texts always put space b/n punctuation chars 
and both preceding and following words, in modern times these punctuation 
chars doesn't have space b/n them and the preceding word).

*Very narrow training_text & wordlist (based on tesseract/langdata_lstm)*
The amh.training_text & amh.wordlist text files used by tesseract (the one 
from langdata_lstm) is very small. (to give you an Idea: for 
tir.traineddata (another language which uses Ethiopic script) the 
tir.training_text from langdata_lstm has more than 400,000 lines while the 
amh.training_text has only around 400 lines)

*Other challenges*

   - The old Amharic books use a font that's not in use (or available).
   - The old Amharic books contain many Ge'ez words (a liturgical language 
   like latin which uses Ethiopic script).
   - The old Amharic books mostly use Ge'ez numbers, while modern Amharic 
   texts use Arabic numbers.

*WHAT I'VE DONE SO FAR*
As an experiment I've tried to fine tune amh.traineddata_best (using `make 
training`) with close to 300 line images & texts (from sample pages of some 
old Amharic books) and using files from langdata_lstm (for 10,000 
iterations).

The resulting traineddata has a very satisfactory improvement in addressing 
some of the challenges mentioned above, especially those regarding 
punctuation chars.

But it still fails to solve the problems I've with some characters (the 
ones not present in the unicharset of amh.traineddata) and fails for almost 
all Ge'ez numbers (eventhough the training sample pages have many Ge'ez 
nums).

*WHAT I'M PLANNING TO DO*
First I want to train tesseract with a large training_text & wordlist 
files, and also a complete unicharset file ,
Then fine tune the resulting traineddata based on sample line images from 
the old books.

*QUESTIONS (for now. I'll definitely add more questions later)*
Is there another path I should take that would get me to where I want?

*Regarding training tesseract with large training_text & wordlist files, 
and also a complete unicharset file:*

   - How to prepare the training_text & wordlist file? (What the text files 
   should contain)
   - How to prepare the unicharset file, and also how to pass it to the 
   `make training` command ?


*Regarding generating a text, image(tif) and box file from training_text:*

I've looked up python scripts to do this job, but have question about the 
proper values for these params in text2image:
--font (what criteria should I use to select the list of fonts),
--leading, --xsize, --ysize, --char_spacing, --exposure, --unicharset_file 
and --margin. 

I've noticed from tesstrain repo for tesseract 5 that the line images are 
tightly cropped (with minimal margin around text line). Is the same 
property (minimal margins) required/desired of the line images generated 
using text2image from the training_text?

*THANKS FOR YOUR TIME !!!*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com.

[tesseract-ocr] Advice on training for Old Amharic texts

Reply via email to