Thank you for the response. I tried by keeping the bazaar at the end and the command runs without any error. However, tesseract is still not able to recognize the extra letters that I have provided in the *tessedit_char_whitelist, *the output is same. The words/ text is in the image is already there in the *vie.user-words* file. 1. Is there any wrong in the way I created that file? 2. How should I approach this issue. Do I need to provide any other extra files? 3. Or I need to re-train it separately for the language from scratch?
Thanks. On Friday, March 29, 2019 at 10:20:19 AM UTC+5:30, shree wrote: > > tesseract procssed_image.png stdout -l vie bazaar -c > tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCD > EFGHIJKLMNOPQRSTUVWXYZ0123456789àâêî > > Bazaar should be listed last - see tesseract --help > > Check your command syntax > > On Fri, 29 Mar 2019, 00:02 , <[email protected] <javascript:>> wrote: > >> I am trying to train a language currently not present in Tesseract. >> >> Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 ( >> installed with sudo apt install tesseract-ocr , and is working perfectly >> for english language) >> >> I have tested with the following command : >> >> tesseract procssed_image.png stdout -l vie >> >> The output is 90% correct except for some characters that are not in the >> vietnam language. >> >> Then, >> I have created the *bazaar* file >> (/usr/share/tesseract-ocr/tessdata/configs/): >> >> >> >> *load_system_dawg Fload_freq_dawg Fuser_words_suffix >> user-words* >> >> created a text file with my custom list of words (around 150 words, one >> word in each line) and named it as* vie.user-words* >> >> And then ran the following command: >> >> tesseract procssed_image.png stdout -l vie bazaar >> >> The result was same. >> >> Then when I tried with : >> >> tesseract procssed_image.png stdout -l vie bazaar -c >> tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789àâêî >> >> tessedit_char_whitelist <- Here, I am trying to put all the list of >> characters that is present in my language and other symbols present in the >> image file. >> >> It shows the following errors and also prints the output ( result is same >> as before ) >> >> >> *read_params_file: Can't open cread_params_file: Can't open >> tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789àâêî* >> >> Please tell me how to fix this issue? Thank you for your time. >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/55c9df9a-762f-43c3-9538-ba7d0c55dd20%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/55c9df9a-762f-43c3-9538-ba7d0c55dd20%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/377503b8-7a6d-4cdc-82d7-964a9b955824%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

