[tesseract-ocr] Trainning tesseract for a new language from scratch that does not exist in Tesseract

haruo195k Thu, 28 Mar 2019 11:33:03 -0700

The steps mentioned here for [tessercat 3.0-3.02][ 
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02
 
] is not clear nor I could find any clear documentation about that:

It is mentioned that the following dataset is required:

tessdata/eng.config
tessdata/eng.unicharset
tessdata/eng.unicharambigs
tessdata/eng.inttemp
tessdata/eng.pffmtable
tessdata/eng.normproto
tessdata/eng.punc-dawg
tessdata/eng.word-dawg
tessdata/eng.number-dawg
tessdata/eng.freq-dawg

But, didn't explained what are the formats or what they actually are?

The language I am working on is not included in utf-8, but is in utf-16,
though it has its official unicode code-point range.

>From what I understood so far,

*eng.word-dawg* : I need to create a text file *mylang.txt* with one word
in each line. Words will in the language in which I am working on and the
letters too. And then convert a *dawg* file. I assume the command for that
is

wordlist2dawg mylang.txt mylang.word-dawg

*eng.number-dawg* : Create a text file *mylangnum.txt* with the numerical
characters - one in each line (0 to 9). Then covert it to
*mylang.number-dawg*

*eng.freq-dawg* : Same step as *eng.word-dawg* file, but with the most
frequent words ( frequent words could be retrieved for example after
processing a certain dataset like newspaper dataset ) starting with the
most frequent word in first line ( no need for frequency) then followed by
the next frequent word in second line and so on.

I don't know about the rest of the 7 remaining files.

Could someone please direct me to better tutorial to add a new language in
tesseract.

OR. Verify my above assumption and tell me about the remaining 7 files.
* And how to proceed further after having all the 10 files. *

*The steps : * [tessercat 3.0-3.02][
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02

] Generate Training Images
Make Box Files
Bootstrapping a new character set
Tif/Box pairs provided

is still bit confusing to me.

Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 (
installed with sudo apt install tesseract-ocr , and is working perfectly
for english language)
I am new in this field, sorry if I made any mistake.

If the requirement is to upgrade the tesseract to version 4 first. Then,
do I need to uninstall the previous pervious version or override with some
update command ? ( will the PPA of alex-tesseract 4 will work for
overriduing the version?)
*Thank you.*

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/6c54f502-0c92-424f-87ca-77fe58694d53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Trainning tesseract for a new language from scratch that does not exist in Tesseract

Reply via email to