On Tuesday, November 14, 2023 at 12:55:07 AM UTC-5 desal...@gmail.com wrote:
It looks like every one is having issues with tesseract. That's not true. It just looks like that because this list is dominated by newcomers to the field of OCR and image processing. I am not able to find any one who has a great success with this software. With all due respect, you must not have looked very hard. It would be really encouraging to hear any success story from any language. As Merlijn already mentioned, the Internet Archive has used Tesseract to OCR over 10 million *documents* (so 100s of millions of pages?) in hundreds of languages https://archive.org/search?query=ocr%3Atesseract* TAMU's eMOP project used Tesseract with custom training to OCR 45 million old crufty page images from the dawn of the printing press https://emop.tamu.edu/software State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines https://arxiv.org/abs/1810.03436 German Parliamentary Corpus (GerParCor) https://arxiv.org/abs/2204.10422 Additional arXiv papers using this search <https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=tesseract&terms-0-field=all&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=exclude&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first>. Following the citation graphs of any of the papers will turn up additional potentially interesting papers. Has anybody a successful training of tesseract? Yes, many. Nick White trained Ancient Greek. Shree has posted copiously about his efforts training Tesseract. See the list archives as well as his repos: https://github.com/Shreeshrii?tab=repositories&q=tessdata_ Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR https://www.mdpi.com/2076-3417/11/20/9752 Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English https://arxiv.org/abs/2109.05952 There's a contrib repository with Acadian, polytonic Greek, and other user-trained languages https://github.com/tesseract-ocr/tessdata_contrib (like, a model that can detect with higher accuracy: 98% or more ?) An accuracy figure without context is meaningless. What language? What domain? What image source? What resolution? Word or character accuracy? etc, etc If you read some of the papers and descriptions of the large scale projects, you'll see that OCR model training is a non-trivial problem which people spend months/years on. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/95575af1-4241-4413-905e-015c36a6e085n%40googlegroups.com.