*Background* I'm trying to use tesseract 5.3.3 on scanned old books written in Amharic (which uses Ethiopic script).
*Major Shortcomings of amh.traineddata from tesseract* *Difference in type of Ethiopic script:* there are Ethiopic script characters in old Amharic texts that are not used in the unicharset of amh.traineddata. *Difference in punctuation styles:* the old texts use some punctuations not used in modern Amharic, and also for some that are used in modern Amharic, the old texts have d/t pattern (mostly space b/n word and punctuation character --- while the old texts always put space b/n punctuation chars and both preceding and following words, in modern times these punctuation chars doesn't have space b/n them and the preceding word). *Very narrow training_text & wordlist (based on tesseract/langdata_lstm)* The amh.training_text & amh.wordlist text files used by tesseract (the one from langdata_lstm) is very small. (to give you an Idea: for tir.traineddata (another language which uses Ethiopic script) the tir.training_text from langdata_lstm has more than 400,000 lines while the amh.training_text has only around 400 lines) *Other challenges* - The old Amharic books use a font that's not in use (or available). - The old Amharic books contain many Ge'ez words (a liturgical language like latin which uses Ethiopic script). - The old Amharic books mostly use Ge'ez numbers, while modern Amharic texts use Arabic numbers. *WHAT I'VE DONE SO FAR* As an experiment I've tried to fine tune amh.traineddata_best (using `make training`) with close to 300 line images & texts (from sample pages of some old Amharic books) and using files from langdata_lstm (for 10,000 iterations). The resulting traineddata has a very satisfactory improvement in addressing some of the challenges mentioned above, especially those regarding punctuation chars. But it still fails to solve the problems I've with some characters (the ones not present in the unicharset of amh.traineddata) and fails for almost all Ge'ez numbers (eventhough the training sample pages have many Ge'ez nums). *WHAT I'M PLANNING TO DO* First I want to train tesseract with a large training_text & wordlist files, and also a complete unicharset file , Then fine tune the resulting traineddata based on sample line images from the old books. *QUESTIONS (for now. I'll definitely add more questions later)* Is there another path I should take that would get me to where I want? *Regarding training tesseract with large training_text & wordlist files, and also a complete unicharset file:* - How to prepare the training_text & wordlist file? (What the text files should contain) - How to prepare the unicharset file, and also how to pass it to the `make training` command ? *Regarding generating a text, image(tif) and box file from training_text:* I've looked up python scripts to do this job, but have question about the proper values for these params in text2image: --font (what criteria should I use to select the list of fonts), --leading, --xsize, --ysize, --char_spacing, --exposure, --unicharset_file and --margin. I've noticed from tesstrain repo for tesseract 5 that the line images are tightly cropped (with minimal margin around text line). Is the same property (minimal margins) required/desired of the line images generated using text2image from the training_text? *THANKS FOR YOUR TIME !!!* -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com.