Hi and welcome to the group.

On Thursday, November 16, 2023 at 10:25:40 AM UTC-5 israel...@gmail.com 
wrote:

 I want to create an entirely new language from handwritten texts. 


I think the "handwritten" aspect is probably at least as important as the 
"new language" part. Tesseract was designed to do optical character 
recognition of mechanically printed texts. Handwriting is very different. 
There have been some attempts to do this in the past, but only with block 
printed characters and, even then recognition rates were under 90% which 
isn't adequate for most uses. If you search the archives here or google 
"tesseract handwriting" (without the quotes), you'll find lots of reading 
material.
 

The language in question is Innu-aimun. The alphabet is quite simple, 
consisting of some of the Latin alphabets with the addition of a 
superscript u character that always appears after a consonant.


There is a Latin script model which has been trained in a language 
independent fashion, so you could give that a try to see how well it does 
(modulo your superscript u). 

For training with natural images (standard training uses synthesized 
images), look at some of the examples in the tesstrain wiki 
<https://github.com/tesseract-ocr/tesstrain/wiki>, particularly the 
GT4HistOCR page <https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR>
.
For any training you'll need ground truth text matched with your segmented 
line images to train on.

Good luck! It sounds like an interesting (but non-trivial) project.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6b825c50-a696-460a-b64e-c8a24c7b0020n%40googlegroups.com.

Reply via email to