Hi and welcome to the group. On Thursday, November 16, 2023 at 10:25:40 AM UTC-5 israel...@gmail.com wrote:
I want to create an entirely new language from handwritten texts. I think the "handwritten" aspect is probably at least as important as the "new language" part. Tesseract was designed to do optical character recognition of mechanically printed texts. Handwriting is very different. There have been some attempts to do this in the past, but only with block printed characters and, even then recognition rates were under 90% which isn't adequate for most uses. If you search the archives here or google "tesseract handwriting" (without the quotes), you'll find lots of reading material. The language in question is Innu-aimun. The alphabet is quite simple, consisting of some of the Latin alphabets with the addition of a superscript u character that always appears after a consonant. There is a Latin script model which has been trained in a language independent fashion, so you could give that a try to see how well it does (modulo your superscript u). For training with natural images (standard training uses synthesized images), look at some of the examples in the tesstrain wiki <https://github.com/tesseract-ocr/tesstrain/wiki>, particularly the GT4HistOCR page <https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR> . For any training you'll need ground truth text matched with your segmented line images to train on. Good luck! It sounds like an interesting (but non-trivial) project. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6b825c50-a696-460a-b64e-c8a24c7b0020n%40googlegroups.com.