[tesseract-ocr] Re: Links to wiki for training new language in tesseract

Tom Morris Fri, 17 Nov 2023 09:58:54 -0800

Hi and welcome to the group.

On Thursday, November 16, 2023 at 10:25:40 AM UTC-5 israel...@gmail.com 
wrote:

I want to create an entirely new language from handwritten texts.

I think the "handwritten" aspect is probably at least as important as the
"new language" part. Tesseract was designed to do optical character
recognition of mechanically printed texts. Handwriting is very different.
There have been some attempts to do this in the past, but only with block
printed characters and, even then recognition rates were under 90% which
isn't adequate for most uses. If you search the archives here or google
"tesseract handwriting" (without the quotes), you'll find lots of reading
material.

The language in question is Innu-aimun. The alphabet is quite simple,
consisting of some of the Latin alphabets with the addition of a
superscript u character that always appears after a consonant.

There is a Latin script model which has been trained in a language
independent fashion, so you could give that a try to see how well it does
(modulo your superscript u).

For training with natural images (standard training uses synthesized
images), look at some of the examples in the tesstrain wiki
<https://github.com/tesseract-ocr/tesstrain/wiki>, particularly the
GT4HistOCR page <https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR>
.
For any training you'll need ground truth text matched with your segmented
line images to train on.

Good luck! It sounds like an interesting (but non-trivial) project.

Tom

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/6b825c50-a696-460a-b64e-c8a24c7b0020n%40googlegroups.com.

[tesseract-ocr] Re: Links to wiki for training new language in tesseract

Reply via email to