[tesseract-ocr] Indigenous Language OCR w/ Tesseract 4.0 in Mac OSX - uncharset_extractor "command not found"

2020-10-08 Thread Josh Holden
Dear All, I’m looking for advice because I am stuck. I’m training Tesseract to do optical character recognition of texts in Lushootseed, an Indigenous language of Washington State with no living speakers. The language has some special characters and many diacritics, and I do not know what the

Re: [tesseract-ocr] Diacriticals Training

2020-10-08 Thread Shree Devi Kumar
I have uploaded the results of various trainings for IAST (with diacritics) and Devanagari for Sanskrit at https://github.com/Shreeshrii/tess5training-sanskrit-iast/tree/main/tessdata/best . The traineddata files and the corresponding lstm-unicharset has been uploaded there. The training has been

[tesseract-ocr] Training Tesseract 4 on real images

2020-10-08 Thread Sim Tov
Hello, I would like to train *Tesseract 4* to recognize certain scripts/languages based on real images rather than synthetic ones. Here are my questions: 1. Is there a tool, preferably cross-platform (Windows/Linux) GUI, that assists in creating .box file based on scanned images? How to get co