Try first with best/Latin.traineddata
that should handle text with diacritics ----------- >>Pango suggested font Gandhari Unicode. Use "Gandhari Unicode" within quotes as Font name >>ERROR: Could not find training text file /usr/local/share/tessdata// eng/eng.training_text give script_dir link to langdata folder where you have your training text ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Aug 29, 2017 at 11:58 AM, Anand Akella <anand.ake...@gmail.com> wrote: > Hi, > Im new to tesseract and have a pdf file with diacritical marks. I tried to > run tesseract 4.0.0 with language eng. I see that it is not able to > recognize the text with diacritical marks. I found a font that can detect > diacritical mark. > > Gandhari Unicode 5.1 > <http://andrewglass.org/download.php?fname=gu5-110_ttf&extn=zip> > > I tried to extract the fonts files and copied to /home/tesseract/Downloads/ > fonts > > Whenever i try to run tesstrain.sh it gives me an error "could not find > font named gandhariunicode" > > ./tesstrain.sh --fontlist 'gandhariunicode' --fonts_dir > /home/tesseract/Downloads/fonts/ --lang eng --langdata_dir > /usr/local/share/tessdata/ --overwrite > > === Starting training for language 'eng' > [Mon Aug 28 23:18:12 PDT 2017] /usr/local/bin/text2image > --fonts_dir=/home/tesseract/Downloads/fonts/ --font=gandhariunicode > --outputbase=/tmp/font_tmp.C9vSySTfge/sample_text.txt > --text=/tmp/font_tmp.C9vSySTfge/sample_text.txt > --fontconfig_tmpdir=/tmp/font_tmp.C9vSySTfge > Could not find font named gandhariunicode. > Pango suggested font Gandhari Unicode. > Please correct --font arg. > > === Phase I: Generating training images === > ERROR: Could not find training text file /usr/local/share/tessdata// > eng/eng.training_text > > What could the issue please let me know. Thanks in advance. > > Thanks, > Anand > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/ca874bc1-1458-49da-bf07-005aacd7d582% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ca874bc1-1458-49da-bf07-005aacd7d582%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVvNa%3DzGWHvZJ6aOa8r2x7frtPrrQ_P1oxV0U7xOmAhuA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.