Re: [tesseract-ocr] Advice on training for Old Amharic texts

Dellu Bw Sat, 13 Jan 2024 01:49:33 -0800

I spend some time trying to improve the default model of Amharic. I default
model has a couple of characters missing. As i have noted in many posts in
this forum, training by removing the top layer is the best method to
introduce new characters.


But i really struggled because the training is deteriotating the base
(default) model. I also have the shortage of processing power.
Tesseract 5.3 also has some flaws which made it hard to use in the third
countries ( electric blackouts)

Dear Menilik, we might need to put out hands together on this.

On Sat, Jan 13, 2024, 11:21 AM Menelik Berhan <menelikber...@gmail.com>
wrote:

> *Background*
> I'm trying to use tesseract 5.3.3 on scanned old books written in Amharic
> (which uses Ethiopic script).
>
> *Major Shortcomings of amh.traineddata from tesseract*
>
> *Difference in type of Ethiopic script:* there are Ethiopic script
> characters in old Amharic texts that are not used in the unicharset of
> amh.traineddata.
>
> *Difference in punctuation styles:* the old texts use some punctuations
> not used in modern Amharic, and also for some that are used in modern
> Amharic, the old texts have d/t pattern (mostly space b/n word and
> punctuation character --- while the old texts always put space b/n
> punctuation chars and both preceding and following words, in modern times
> these punctuation chars doesn't have space b/n them and the preceding word).
>
> *Very narrow training_text & wordlist (based on tesseract/langdata_lstm)*
> The amh.training_text & amh.wordlist text files used by tesseract (the one
> from langdata_lstm) is very small. (to give you an Idea: for
> tir.traineddata (another language which uses Ethiopic script) the
> tir.training_text from langdata_lstm has more than 400,000 lines while the
> amh.training_text has only around 400 lines)
>
> *Other challenges*
>
>    - The old Amharic books use a font that's not in use (or available).
>    - The old Amharic books contain many Ge'ez words (a liturgical
>    language like latin which uses Ethiopic script).
>    - The old Amharic books mostly use Ge'ez numbers, while modern Amharic
>    texts use Arabic numbers.
>
> *WHAT I'VE DONE SO FAR*
> As an experiment I've tried to fine tune amh.traineddata_best (using `make
> training`) with close to 300 line images & texts (from sample pages of some
> old Amharic books) and using files from langdata_lstm (for 10,000
> iterations).
>
> The resulting traineddata has a very satisfactory improvement in
> addressing some of the challenges mentioned above, especially those
> regarding punctuation chars.
>
> But it still fails to solve the problems I've with some characters (the
> ones not present in the unicharset of amh.traineddata) and fails for almost
> all Ge'ez numbers (eventhough the training sample pages have many Ge'ez
> nums).
>
> *WHAT I'M PLANNING TO DO*
> First I want to train tesseract with a large training_text & wordlist
> files, and also a complete unicharset file ,
> Then fine tune the resulting traineddata based on sample line images from
> the old books.
>
> *QUESTIONS (for now. I'll definitely add more questions later)*
> Is there another path I should take that would get me to where I want?
>
> *Regarding training tesseract with large training_text & wordlist files,
> and also a complete unicharset file:*
>
>    - How to prepare the training_text & wordlist file? (What the text
>    files should contain)
>    - How to prepare the unicharset file, and also how to pass it to the
>    `make training` command ?
>
>
> *Regarding generating a text, image(tif) and box file from training_text:*
>
> I've looked up python scripts to do this job, but have question about the
> proper values for these params in text2image:
> --font (what criteria should I use to select the list of fonts),
> --leading, --xsize, --ysize, --char_spacing, --exposure, --unicharset_file
> and --margin.
>
> I've noticed from tesstrain repo for tesseract 5 that the line images are
> tightly cropped (with minimal margin around text line). Is the same
> property (minimal margins) required/desired of the line images generated
> using text2image from the training_text?
>
> *THANKS FOR YOUR TIME !!!*
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BLi4kBBm9KLxRAif-GocRoz%2BsjwYF%3D5FHkWfVtEhiDdCtTyzA%40mail.gmail.com.

Re: [tesseract-ocr] Advice on training for Old Amharic texts

Reply via email to