[tesseract-ocr] building tir.traineddata from scratch

Biniam Tue, 04 Aug 2020 20:20:11 -0700

For language tir (which has over 350 characters) only 272 are included in 
the existing lstm tir.traineddata. I have a file with all the missing 
charset included and I have a training text. I want to recreate 
tir.traineddata but I could not find the exact commands and parameters used 
to make it.

Basically, how to compile
https://github.com/tesseract-ocr/langdata_lstm/tree/master/tir so I can get
the same output as
https://github.com/tesseract-ocr/tessdata_best/blob/master/tir.traineddata

I followed the documentation in
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html to
train from scratch and come up with a set of commands shown here
https://github.com/TigrinyaNLP/Tigrinya-tasseract-ocr/blob/master/bin/train_from_scrach.sh

But the final result is not that good. for example, I used
--max_iterations 50000 and net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48
Lfx96 Lrx96 Lfx256 O1c352] but this parameters are copied from the eng
example and may not be good fit for 'tir'. I would appreciate it if someone
could tell me what commands are used to build tir.traineddata in
tessdata_best.

I know I could use fine-tune or adding the missing chars instead of
building from scratch, but I have more things to modify (like adding
wordlist, and other improvements, fonts) which will improve the quality of
'tir' a lot. This language is not that big and it should not be a big task
as rebuilding 'eng'.

Thanks,
Biniam

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/1b43703e-2816-40f0-8a23-41b2ed10c4eao%40googlegroups.com.

[tesseract-ocr] building tir.traineddata from scratch

Reply via email to