Hi Wincent, Thank you for sharing these tools. I find create-dictdata to be very useful.
I wanted to know if you have modified any ocr-evaluation tools to handle the high unicode range such as for Akkadian language. I was trying to test regarding Modi script (*Range*: U+11600..U+1165F; (96 code points)) and found that `ocrevalutf8 accuracy` does not work well for it. Any suggestions ... Shree On Sunday, January 5, 2020 at 2:22:50 AM UTC+5:30, Wincent Balin wrote: > > Hi all, > > I would like to announce pytesstrain, a collection of Tesseract training > tools, as well as the underlying library. The tools were created while > training Tesseract to recognise Akkadian language (stay tuned for more > posts!), to solve the problems that emerged in the process. > > You can install it with pip install pytesstrain. > > The PyPI page for the package is https://pypi.org/project/pytesstrain/. > The GitHub project page is https://github.com/wincentbalin/pytesstrain. > > This package contains the tools to create dictionary data (wordlist, bi- > and unigram lists, etc.), rewrap lines in text files to the specified > length, collect most frequent recognition errors and dump them into > unicharambigs file, and to perform recognition metrics (WER and CER). It > also contains the run_test() function, which creates an image file from > the given string and performs OCR on it afterwards, as well as its > parallelised version, run_tests(), which can be used in future tools. > > Feedback, suggestions, etc would be most welcome. > > Yours truly, > > Wincent > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/22d65439-54f1-4628-9c04-d7a35777b950%40googlegroups.com.