>The Internet Archive has switched to using Tesseract for all our OCR, I am so happy to hear this. It will be great to have the Indic languages that were marked as non-ocrable so far be converted to text correctly on Internet Archive.
Is there any page with instructions to do this? Can a language be specified while OCRing? eg. Better results are many times received using script/Devanagari instead of san for Sanskrit. Regarding your question about tessdata, there have only been minor changes to tessdata files but adding a tag is a good idea. I suggest you post this as a feature request in the repo. On Wed, Jan 27, 2021, 15:58 Merlijn B.W. Wajer <merl...@archive.org> wrote: > Hi, > > With Tesseract now switching to regular (alpha) releases of 5.0.0; does > it make sense to consider some versioning for language files as well? > > The Internet Archive has switched to using Tesseract for all our OCR, > and I'm hoping that we can record exactly what version of language files > was used for a specific OCR job. Currently, the answer is simple, since > we're using the default packages from Ubuntu focal, but I am working on > switching to Tesseract release/tag 5.0.0-20201231. > > But the tessdata_fast (or tessdata_best, for that matter) do not seem to > have any recent 5.x releases: > https://github.com/tesseract-ocr/tessdata_fast/releases > > Are there plans to create a release/tag for the tessdata_* repositories? > > Cheers, > Merlijn > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/10c2e872-f9e2-d637-2c16-84a46f800e0a%40archive.org > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWaa_BFokK6Z0xN9iYs-wekJwaFKyuAPJmAiozXKS4Ffw%40mail.gmail.com.