[tesseract-ocr] Re: Any success story?

On Tuesday, November 14, 2023 at 12:55:07 AM UTC-5 desal...@gmail.com wrote:

It looks like every one is having issues with tesseract.

That's not true. It just looks like that because this list is dominated by
newcomers
to the field of OCR and image processing.

I am not able to find any one who has a great success with this software.

With all due respect, you must not have looked very hard.

It would be really encouraging to hear any success story from any language.

As Merlijn already mentioned, the Internet Archive has used Tesseract to
OCR
over 10 million *documents* (so 100s of millions of pages?) in hundreds of
languages
https://archive.org/search?query=ocr%3Atesseract*

TAMU's eMOP project used Tesseract with custom training to OCR 45 million
old crufty page images from the dawn of the printing press
https://emop.tamu.edu/software

State of the Art Optical Character Recognition of 19th Century Fraktur
Scripts using Open Source Engines
https://arxiv.org/abs/1810.03436

German Parliamentary Corpus (GerParCor)
https://arxiv.org/abs/2204.10422

Additional arXiv papers using this search
<https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=tesseract&terms-0-field=all&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=exclude&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first>.

Following the citation graphs of any of the
papers will turn up additional potentially interesting papers.

Has anybody a successful training of tesseract?

Yes, many.

Nick White trained Ancient Greek.
Shree has posted copiously about his efforts training Tesseract. See the
list archives as well as his repos:
https://github.com/Shreeshrii?tab=repositories&q=tessdata_

Exploiting Script Similarities to Compensate for the Large Amount of Data
in Training Tesseract LSTM: Towards Kurdish OCR
https://www.mdpi.com/2076-3417/11/20/9752

Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy
Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English
https://arxiv.org/abs/2109.05952

There's a contrib repository with Acadian, polytonic Greek, and other
user-trained languages
https://github.com/tesseract-ocr/tessdata_contrib

(like, a model that can detect with higher accuracy: 98% or more ?)

An accuracy figure without context is meaningless. What language? What
domain?
What image source? What resolution? Word or character accuracy? etc, etc

If you read some of the papers and descriptions of the large scale
projects, you'll see
that OCR model training is a non-trivial problem which people spend
months/years on.

Tom

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/95575af1-4241-4413-905e-015c36a6e085n%40googlegroups.com.

Reply via email to