Re: [tesseract-ocr] Trainning tesseract for a new language from scratch that does not exist in Tesseract

Shree Devi Kumar Thu, 28 Mar 2019 21:52:44 -0700

For tesseract 3, and training language similar to vie, take a look at
vietocr and jtessboxeditor.


On Fri, 29 Mar 2019, 00:02 , <[email protected]> wrote:

> The steps mentioned here for [tessercat 3.0-3.02][
> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02
> ] is not clear nor I could find any clear documentation about that:
>
> It is mentioned that the following dataset is required:
>
>     tessdata/eng.config
>     tessdata/eng.unicharset
>     tessdata/eng.unicharambigs
>     tessdata/eng.inttemp
>     tessdata/eng.pffmtable
>     tessdata/eng.normproto
>     tessdata/eng.punc-dawg
>     tessdata/eng.word-dawg
>     tessdata/eng.number-dawg
>     tessdata/eng.freq-dawg
>
>
> But, didn't explained what are the formats or what they actually are?
>
> The language I am working on is not included in utf-8, but is in utf-16,
> though it has its official unicode code-point range.
>
> From what I understood so far,
>
> *eng.word-dawg* : I need to create a text file *mylang.txt* with one word
> in each line. Words will in the language in which I am working on and the
> letters too. And then convert a *dawg* file. I assume the command for
> that is
>
>     wordlist2dawg mylang.txt mylang.word-dawg
>
> *eng.number-dawg* : Create a text file *mylangnum.txt* with the numerical
> characters - one in each line (0 to 9). Then covert it to
> *mylang.number-dawg*
>
>
> *eng.freq-dawg* : Same step as *eng.word-dawg* file, but with the most
> frequent words ( frequent words could be retrieved for example after
> processing a certain dataset like newspaper dataset ) starting with the
> most frequent word in first line ( no need for frequency) then followed by
> the next frequent word in second line and so on.
>
> I don't know about the rest of the 7 remaining files.
>
> Could someone please direct me to better tutorial to add a new language in
> tesseract.
>
> OR. Verify my above assumption and tell me about the remaining 7 files.
> * And how to proceed further after having all the 10 files. *
>
> *The steps : * [tessercat 3.0-3.02][
> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02
> ] Generate Training Images
> Make Box Files
> Bootstrapping a new character set
> Tif/Box pairs provided
>
> is still bit confusing to me.
>
> Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 (
> installed with sudo apt install tesseract-ocr , and is working perfectly
> for english language)
> I am new in this field, sorry if I made any mistake.
>
> If the requirement is to upgrade the tesseract to version 4 first. Then,
> do  I need to uninstall the previous pervious version or override with some
> update command ? ( will the PPA of alex-tesseract 4 will work for
> overriduing the version?)
> *Thank you.*
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6c54f502-0c92-424f-87ca-77fe58694d53%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6c54f502-0c92-424f-87ca-77fe58694d53%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUg_AHtbP0tniKM6h-5i_i%3D4%3DasCDkxayxx%3D%3DhKAovgpA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trainning tesseract for a new language from scratch that does not exist in Tesseract

Reply via email to