>The Internet Archive has switched to using Tesseract for all our OCR,

I am so happy to hear this. It will be great to have the Indic languages
that were marked as non-ocrable so far be converted to text correctly on
Internet Archive.

Is there any page with instructions to do this? Can a language be specified
while OCRing? eg. Better results are many times received using
script/Devanagari instead of san for Sanskrit.

Regarding your question about tessdata, there have only been minor changes
to tessdata files but adding a tag is a good idea. I suggest you post this
as a feature request in the repo.






On Wed, Jan 27, 2021, 15:58 Merlijn B.W. Wajer <merl...@archive.org> wrote:

> Hi,
>
> With Tesseract now switching to regular (alpha) releases of 5.0.0; does
> it make sense to consider some versioning for language files as well?
>
> The Internet Archive has switched to using Tesseract for all our OCR,
> and I'm hoping that we can record exactly what version of language files
> was used for a specific OCR job. Currently, the answer is simple, since
> we're using the default packages from Ubuntu focal, but I am working on
> switching to Tesseract release/tag 5.0.0-20201231.
>
> But the tessdata_fast (or tessdata_best, for that matter) do not seem to
> have any recent 5.x releases:
> https://github.com/tesseract-ocr/tessdata_fast/releases
>
> Are there plans to create a release/tag for the tessdata_* repositories?
>
> Cheers,
> Merlijn
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/10c2e872-f9e2-d637-2c16-84a46f800e0a%40archive.org
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWaa_BFokK6Z0xN9iYs-wekJwaFKyuAPJmAiozXKS4Ffw%40mail.gmail.com.

Reply via email to