Thanks, Wincent. I will try out the tools added by you. I found a Unicode version of the ISRI evaluation tools at https://github.com/eddieantonio/ocreval which handles the high range Unicodepoints also. See https://github.com/Shreeshrii/tesstrain-modi/blob/master/reports/modi-eval-modiLayer_1.017_157724_324000/report_modiLayer_1.017_157724_324000-modi-ALL.txt for an example
Do you have a workflow for tesseract training using your tools? If so, I would like to add/refer to it in Tesseract documentation. On Tue, Feb 4, 2020 at 2:06 AM Wincent Balin <wincent.ba...@gmail.com> wrote: > Hi Shree, > > I am glad you find the package already useful :-) . > > As to your question: I did not use the ocr-evaluation tools, only the > language_metrics utility. So, regrettably, I cannot help you here. But > maybe you could try the same utility too? > > By the way, I added a create_ground_truth utility, which creates .gt.txt > files as well as the associated .tif files for every specified font, to > the package. I think it could be useful for anyone who does not have a > ground truth collection yet. > > Kind regards, > > Wincent > > > Am Mittwoch, 29. Januar 2020 06:47:01 UTC+1 schrieb shree: >> >> Hi Wincent, >> >> Thank you for sharing these tools. I find create-dictdata to be very >> useful. >> >> I wanted to know if you have modified any ocr-evaluation tools to handle >> the high unicode range such as for Akkadian language. >> >> I was trying to test regarding Modi script (*Range*: U+11600..U+1165F; >> (96 code points)) and found that `ocrevalutf8 accuracy` does not work >> well for it. Any suggestions ... >> >> Shree >> >> On Sunday, January 5, 2020 at 2:22:50 AM UTC+5:30, Wincent Balin wrote: >>> >>> Hi all, >>> >>> I would like to announce pytesstrain, a collection of Tesseract >>> training tools, as well as the underlying library. The tools were created >>> while training Tesseract to recognise Akkadian language (stay tuned for >>> more posts!), to solve the problems that emerged in the process. >>> >>> You can install it with pip install pytesstrain. >>> >>> The PyPI page for the package is https://pypi.org/project/pytesstrain/. >>> The GitHub project page is https://github.com/wincentbalin/pytesstrain. >>> >>> This package contains the tools to create dictionary data (wordlist, bi- >>> and unigram lists, etc.), rewrap lines in text files to the specified >>> length, collect most frequent recognition errors and dump them into >>> unicharambigs file, and to perform recognition metrics (WER and CER). It >>> also contains the run_test() function, which creates an image file from >>> the given string and performs OCR on it afterwards, as well as its >>> parallelised version, run_tests(), which can be used in future tools. >>> >>> Feedback, suggestions, etc would be most welcome. >>> >>> Yours truly, >>> >>> Wincent >>> >>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/3df5801b-7119-4451-9bb5-5fabc3e66bb1%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/3df5801b-7119-4451-9bb5-5fabc3e66bb1%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-Xyj4bU3-aw%3DjVP9%3DTvm5uPjLDuFesC4G%2B6nx6JM4Ug%40mail.gmail.com.