The steps mentioned here for [tessercat 3.0-3.02][
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02
] is not clear nor I could find any clear documentation about that:
It is mentioned that the following dataset is required:
tessdata/eng.config
tessdata/eng.unicharset
tessdata/eng.unicharambigs
tessdata/eng.inttemp
tessdata/eng.pffmtable
tessdata/eng.normproto
tessdata/eng.punc-dawg
tessdata/eng.word-dawg
tessdata/eng.number-dawg
tessdata/eng.freq-dawg
But, didn't explained what are the formats or what they actually are?
The language I am working on is not included in utf-8, but is in utf-16,
though it has its official unicode code-point range.
>From what I understood so far,
*eng.word-dawg* : I need to create a text file *mylang.txt* with one word
in each line. Words will in the language in which I am working on and the
letters too. And then convert a *dawg* file. I assume the command for that
is
wordlist2dawg mylang.txt mylang.word-dawg
*eng.number-dawg* : Create a text file *mylangnum.txt* with the numerical
characters - one in each line (0 to 9). Then covert it to
*mylang.number-dawg*
*eng.freq-dawg* : Same step as *eng.word-dawg* file, but with the most
frequent words ( frequent words could be retrieved for example after
processing a certain dataset like newspaper dataset ) starting with the
most frequent word in first line ( no need for frequency) then followed by
the next frequent word in second line and so on.
I don't know about the rest of the 7 remaining files.
Could someone please direct me to better tutorial to add a new language in
tesseract.
OR. Verify my above assumption and tell me about the remaining 7 files.
* And how to proceed further after having all the 10 files. *
*The steps : * [tessercat 3.0-3.02][
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02
] Generate Training Images
Make Box Files
Bootstrapping a new character set
Tif/Box pairs provided
is still bit confusing to me.
Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 (
installed with sudo apt install tesseract-ocr , and is working perfectly
for english language)
I am new in this field, sorry if I made any mistake.
If the requirement is to upgrade the tesseract to version 4 first. Then,
do I need to uninstall the previous pervious version or override with some
update command ? ( will the PPA of alex-tesseract 4 will work for
overriduing the version?)
*Thank you.*
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/6c54f502-0c92-424f-87ca-77fe58694d53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.