The steps mentioned here for [tessercat 3.0-3.02][ 
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02
 
] is not clear nor I could find any clear documentation about that:

It is mentioned that the following dataset is required:

    tessdata/eng.config
    tessdata/eng.unicharset
    tessdata/eng.unicharambigs
    tessdata/eng.inttemp
    tessdata/eng.pffmtable
    tessdata/eng.normproto
    tessdata/eng.punc-dawg
    tessdata/eng.word-dawg
    tessdata/eng.number-dawg
    tessdata/eng.freq-dawg


But, didn't explained what are the formats or what they actually are?

The language I am working on is not included in utf-8, but is in utf-16, 
though it has its official unicode code-point range.

>From what I understood so far, 

*eng.word-dawg* : I need to create a text file *mylang.txt* with one word 
in each line. Words will in the language in which I am working on and the 
letters too. And then convert a *dawg* file. I assume the command for that 
is 

    wordlist2dawg mylang.txt mylang.word-dawg

*eng.number-dawg* : Create a text file *mylangnum.txt* with the numerical 
characters - one in each line (0 to 9). Then covert it to 
*mylang.number-dawg*


*eng.freq-dawg* : Same step as *eng.word-dawg* file, but with the most 
frequent words ( frequent words could be retrieved for example after 
processing a certain dataset like newspaper dataset ) starting with the 
most frequent word in first line ( no need for frequency) then followed by 
the next frequent word in second line and so on.

I don't know about the rest of the 7 remaining files.

Could someone please direct me to better tutorial to add a new language in 
tesseract.

OR. Verify my above assumption and tell me about the remaining 7 files.
* And how to proceed further after having all the 10 files. *

*The steps : * [tessercat 3.0-3.02][ 
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02
 
] Generate Training Images
Make Box Files
Bootstrapping a new character set
Tif/Box pairs provided

is still bit confusing to me.

Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 ( 
installed with sudo apt install tesseract-ocr , and is working perfectly 
for english language)
I am new in this field, sorry if I made any mistake. 

If the requirement is to upgrade the tesseract to version 4 first. Then, 
do  I need to uninstall the previous pervious version or override with some 
update command ? ( will the PPA of alex-tesseract 4 will work for 
overriduing the version?)
*Thank you.*



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6c54f502-0c92-424f-87ca-77fe58694d53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to