Salut Romain, On Friday, February 9, 2024 at 6:03:02 AM UTC-5 Romain B. (Le Belge) wrote:
I'm trying to fix this issue. By what i have read, i think i need to re-train the russian language in tesseract for it to support accents. I found this <https://github.com/tesseract-ocr/langdata/tree/main/rus_accent> folder in langdata, but can't find a way to use it to re-train the russian language. How can i use the rus_accent folder and its files to easily re-train the russian language ? Looking at the history [1] for that folder makes me think that it was an incomplete work-in-progress, but it's also for the previous OCR engine. You want to look at langdata_lstm/rus [2] for your training text and then using the fine tuning directions [3] with the rus model from tessdata_best/rus.traineddata [4]. This would involve going through and adding accents to some proportion of the vowels and then rerunning the training. For example, there are 10 occurrences of the string балкон and you could change some or all of them to have your accent mark (I don't know if there's a standard convention for encoding them). As a caveat, I don't know if adding accented variants of all 10 vowels would be considered "a few characters" for the purposes of the finetuning instructions. Good luck! Tom [1] https://github.com/tesseract-ocr/langdata/commits/main/rus_accent [2] https://github.com/tesseract-ocr/langdata_lstm/blob/main/rus/rus.training_text [3] https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters [4] https://github.com/tesseract-ocr/tessdata_best/blob/main/rus.traineddata -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dcc37079-d5af-47f0-bb12-28bbaf7195a4n%40googlegroups.com.