Hi, Im new to tesseract and have a pdf file with diacritical marks. I tried to run tesseract 4.0.0 with language eng. I see that it is not able to recognize the text with diacritical marks. I found a font that can detect diacritical mark.
Gandhari Unicode 5.1 <http://andrewglass.org/download.php?fname=gu5-110_ttf&extn=zip> I tried to extract the fonts files and copied to /home/tesseract/Downloads/fonts Whenever i try to run tesstrain.sh it gives me an error "could not find font named gandhariunicode" ./tesstrain.sh --fontlist 'gandhariunicode' --fonts_dir /home/tesseract/Downloads/fonts/ --lang eng --langdata_dir /usr/local/share/tessdata/ --overwrite === Starting training for language 'eng' [Mon Aug 28 23:18:12 PDT 2017] /usr/local/bin/text2image --fonts_dir=/home/tesseract/Downloads/fonts/ --font=gandhariunicode --outputbase=/tmp/font_tmp.C9vSySTfge/sample_text.txt --text=/tmp/font_tmp.C9vSySTfge/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.C9vSySTfge Could not find font named gandhariunicode. Pango suggested font Gandhari Unicode. Please correct --font arg. === Phase I: Generating training images === ERROR: Could not find training text file /usr/local/share/tessdata//eng/eng.training_text What could the issue please let me know. Thanks in advance. Thanks, Anand -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ca874bc1-1458-49da-bf07-005aacd7d582%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.