Re: [tesseract-ocr] Advice on training for Old Amharic texts

Menelik Berhan Sat, 13 Jan 2024 05:08:02 -0800

Thanks for your swift reply. It would be my pleasure to collaborate with 
you.


I've noticed that there is are extensive guides and tutorials regarding 
training tesseract 4.x, and I wanted to switch to 4.x version.
I wanted to ask what would be the trade off if I used tesseract 4.x instead 
of 5.x ?

Thanks for your time!!!


On Saturday, January 13, 2024 at 12:49:36 PM UTC+3 elvi...@gmail.com wrote:

> I spend some time trying to improve the default model of Amharic. I 
> default model has a couple of characters missing. As i have noted in many 
> posts in this forum, training by removing the top layer is the best method 
> to introduce new characters.
>
> But i really struggled because the training is deteriotating the base 
> (default) model. I also have the shortage of processing power.
> Tesseract 5.3 also has some flaws which made it hard to use in the third 
> countries ( electric blackouts)
>
> Dear Menilik, we might need to put out hands together on this.
>
> On Sat, Jan 13, 2024, 11:21 AM Menelik Berhan <meneli...@gmail.com> wrote:
>
>> *Background*
>> I'm trying to use tesseract 5.3.3 on scanned old books written in Amharic 
>> (which uses Ethiopic script).
>>
>> *Major Shortcomings of amh.traineddata from tesseract*
>>
>> *Difference in type of Ethiopic script:* there are Ethiopic script 
>> characters in old Amharic texts that are not used in the unicharset of 
>> amh.traineddata.
>>
>> *Difference in punctuation styles:* the old texts use some punctuations 
>> not used in modern Amharic, and also for some that are used in modern 
>> Amharic, the old texts have d/t pattern (mostly space b/n word and 
>> punctuation character --- while the old texts always put space b/n 
>> punctuation chars and both preceding and following words, in modern times 
>> these punctuation chars doesn't have space b/n them and the preceding word).
>>
>> *Very narrow training_text & wordlist (based on tesseract/langdata_lstm)*
>> The amh.training_text & amh.wordlist text files used by tesseract (the 
>> one from langdata_lstm) is very small. (to give you an Idea: for 
>> tir.traineddata (another language which uses Ethiopic script) the 
>> tir.training_text from langdata_lstm has more than 400,000 lines while the 
>> amh.training_text has only around 400 lines)
>>
>> *Other challenges*
>>
>>    - The old Amharic books use a font that's not in use (or available).
>>    - The old Amharic books contain many Ge'ez words (a liturgical 
>>    language like latin which uses Ethiopic script).
>>    - The old Amharic books mostly use Ge'ez numbers, while modern 
>>    Amharic texts use Arabic numbers.
>>
>> *WHAT I'VE DONE SO FAR*
>> As an experiment I've tried to fine tune amh.traineddata_best (using 
>> `make training`) with close to 300 line images & texts (from sample pages 
>> of some old Amharic books) and using files from langdata_lstm (for 10,000 
>> iterations).
>>
>> The resulting traineddata has a very satisfactory improvement in 
>> addressing some of the challenges mentioned above, especially those 
>> regarding punctuation chars.
>>
>> But it still fails to solve the problems I've with some characters (the 
>> ones not present in the unicharset of amh.traineddata) and fails for almost 
>> all Ge'ez numbers (eventhough the training sample pages have many Ge'ez 
>> nums).
>>
>> *WHAT I'M PLANNING TO DO*
>> First I want to train tesseract with a large training_text & wordlist 
>> files, and also a complete unicharset file ,
>> Then fine tune the resulting traineddata based on sample line images from 
>> the old books.
>>
>> *QUESTIONS (for now. I'll definitely add more questions later)*
>> Is there another path I should take that would get me to where I want?
>>
>> *Regarding training tesseract with large training_text & wordlist files, 
>> and also a complete unicharset file:*
>>
>>    - How to prepare the training_text & wordlist file? (What the text 
>>    files should contain)
>>    - How to prepare the unicharset file, and also how to pass it to the 
>>    `make training` command ?
>>
>>
>> *Regarding generating a text, image(tif) and box file from training_text:*
>>
>> I've looked up python scripts to do this job, but have question about the 
>> proper values for these params in text2image:
>> --font (what criteria should I use to select the list of fonts),
>> --leading, --xsize, --ysize, --char_spacing, --exposure, 
>> --unicharset_file and --margin. 
>>
>> I've noticed from tesstrain repo for tesseract 5 that the line images are 
>> tightly cropped (with minimal margin around text line). Is the same 
>> property (minimal margins) required/desired of the line images generated 
>> using text2image from the training_text?
>>
>> *THANKS FOR YOUR TIME !!!*
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/9bda9bc4-b07a-491b-b8fc-fbb25b54c368n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bf4d57dc-a4ea-4157-8782-0acca178c9dan%40googlegroups.com.

Re: [tesseract-ocr] Advice on training for Old Amharic texts

Reply via email to