For tesseract 3, and training language similar to vie, take a look at vietocr and jtessboxeditor.
On Fri, 29 Mar 2019, 00:02 , <[email protected]> wrote: > The steps mentioned here for [tessercat 3.0-3.02][ > https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02 > ] is not clear nor I could find any clear documentation about that: > > It is mentioned that the following dataset is required: > > tessdata/eng.config > tessdata/eng.unicharset > tessdata/eng.unicharambigs > tessdata/eng.inttemp > tessdata/eng.pffmtable > tessdata/eng.normproto > tessdata/eng.punc-dawg > tessdata/eng.word-dawg > tessdata/eng.number-dawg > tessdata/eng.freq-dawg > > > But, didn't explained what are the formats or what they actually are? > > The language I am working on is not included in utf-8, but is in utf-16, > though it has its official unicode code-point range. > > From what I understood so far, > > *eng.word-dawg* : I need to create a text file *mylang.txt* with one word > in each line. Words will in the language in which I am working on and the > letters too. And then convert a *dawg* file. I assume the command for > that is > > wordlist2dawg mylang.txt mylang.word-dawg > > *eng.number-dawg* : Create a text file *mylangnum.txt* with the numerical > characters - one in each line (0 to 9). Then covert it to > *mylang.number-dawg* > > > *eng.freq-dawg* : Same step as *eng.word-dawg* file, but with the most > frequent words ( frequent words could be retrieved for example after > processing a certain dataset like newspaper dataset ) starting with the > most frequent word in first line ( no need for frequency) then followed by > the next frequent word in second line and so on. > > I don't know about the rest of the 7 remaining files. > > Could someone please direct me to better tutorial to add a new language in > tesseract. > > OR. Verify my above assumption and tell me about the remaining 7 files. > * And how to proceed further after having all the 10 files. * > > *The steps : * [tessercat 3.0-3.02][ > https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02 > ] Generate Training Images > Make Box Files > Bootstrapping a new character set > Tif/Box pairs provided > > is still bit confusing to me. > > Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 ( > installed with sudo apt install tesseract-ocr , and is working perfectly > for english language) > I am new in this field, sorry if I made any mistake. > > If the requirement is to upgrade the tesseract to version 4 first. Then, > do I need to uninstall the previous pervious version or override with some > update command ? ( will the PPA of alex-tesseract 4 will work for > overriduing the version?) > *Thank you.* > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/6c54f502-0c92-424f-87ca-77fe58694d53%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/6c54f502-0c92-424f-87ca-77fe58694d53%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUg_AHtbP0tniKM6h-5i_i%3D4%3DasCDkxayxx%3D%3DhKAovgpA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

