I spend some time trying to improve the default model of Amharic. I default model has a couple of characters missing. As i have noted in many posts in this forum, training by removing the top layer is the best method to introduce new characters.
But i really struggled because the training is deteriotating the base (default) model. I also have the shortage of processing power. Tesseract 5.3 also has some flaws which made it hard to use in the third countries ( electric blackouts) Dear Menilik, we might need to put out hands together on this. On Sat, Jan 13, 2024, 11:21 AM Menelik Berhan <menelikber...@gmail.com> wrote: > *Background* > I'm trying to use tesseract 5.3.3 on scanned old books written in Amharic > (which uses Ethiopic script). > > *Major Shortcomings of amh.traineddata from tesseract* > > *Difference in type of Ethiopic script:* there are Ethiopic script > characters in old Amharic texts that are not used in the unicharset of > amh.traineddata. > > *Difference in punctuation styles:* the old texts use some punctuations > not used in modern Amharic, and also for some that are used in modern > Amharic, the old texts have d/t pattern (mostly space b/n word and > punctuation character --- while the old texts always put space b/n > punctuation chars and both preceding and following words, in modern times > these punctuation chars doesn't have space b/n them and the preceding word). > > *Very narrow training_text & wordlist (based on tesseract/langdata_lstm)* > The amh.training_text & amh.wordlist text files used by tesseract (the one > from langdata_lstm) is very small. (to give you an Idea: for > tir.traineddata (another language which uses Ethiopic script) the > tir.training_text from langdata_lstm has more than 400,000 lines while the > amh.training_text has only around 400 lines) > > *Other challenges* > > - The old Amharic books use a font that's not in use (or available). > - The old Amharic books contain many Ge'ez words (a liturgical > language like latin which uses Ethiopic script). > - The old Amharic books mostly use Ge'ez numbers, while modern Amharic > texts use Arabic numbers. > > *WHAT I'VE DONE SO FAR* > As an experiment I've tried to fine tune amh.traineddata_best (using `make > training`) with close to 300 line images & texts (from sample pages of some > old Amharic books) and using files from langdata_lstm (for 10,000 > iterations). > > The resulting traineddata has a very satisfactory improvement in > addressing some of the challenges mentioned above, especially those > regarding punctuation chars. > > But it still fails to solve the problems I've with some characters (the > ones not present in the unicharset of amh.traineddata) and fails for almost > all Ge'ez numbers (eventhough the training sample pages have many Ge'ez > nums). > > *WHAT I'M PLANNING TO DO* > First I want to train tesseract with a large training_text & wordlist > files, and also a complete unicharset file , > Then fine tune the resulting traineddata based on sample line images from > the old books. > > *QUESTIONS (for now. I'll definitely add more questions later)* > Is there another path I should take that would get me to where I want? > > *Regarding training tesseract with large training_text & wordlist files, > and also a complete unicharset file:* > > - How to prepare the training_text & wordlist file? (What the text > files should contain) > - How to prepare the unicharset file, and also how to pass it to the > `make training` command ? > > > *Regarding generating a text, image(tif) and box file from training_text:* > > I've looked up python scripts to do this job, but have question about the > proper values for these params in text2image: > --font (what criteria should I use to select the list of fonts), > --leading, --xsize, --ysize, --char_spacing, --exposure, --unicharset_file > and --margin. > > I've noticed from tesstrain repo for tesseract 5 that the line images are > tightly cropped (with minimal margin around text line). Is the same > property (minimal margins) required/desired of the line images generated > using text2image from the training_text? > > *THANKS FOR YOUR TIME !!!* > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kBBm9KLxRAif-GocRoz%2BsjwYF%3D5FHkWfVtEhiDdCtTyzA%40mail.gmail.com.