Salut Romain,

On Friday, February 9, 2024 at 6:03:02 AM UTC-5 Romain B. (Le Belge) wrote:


I'm trying to fix this issue. By what i have read, i think i need to 
re-train the russian language in tesseract for it to support accents.
I found this 
<https://github.com/tesseract-ocr/langdata/tree/main/rus_accent> folder in 
langdata, but can't find a way to use it to re-train the russian language.

How can i use the rus_accent folder and its files to easily re-train the 
russian language ?


Looking at the history [1] for that folder makes me think that it was an 
incomplete work-in-progress, but it's also for the previous OCR engine.  
You want to look at langdata_lstm/rus [2] for your training text and then 
using the fine tuning directions [3] with the rus model from 
tessdata_best/rus.traineddata [4]. This would involve going through and 
adding accents to some proportion of the vowels and then rerunning the 
training. For example, there are 10 occurrences of the string балкон and 
you could change some or all of them to have your accent mark (I don't know 
if there's a standard convention for encoding them).

As a caveat, I don't know if adding accented variants of all 10 vowels 
would be considered "a few characters" for the purposes of the finetuning 
instructions.

Good luck!

Tom

[1] https://github.com/tesseract-ocr/langdata/commits/main/rus_accent 
[2] 
https://github.com/tesseract-ocr/langdata_lstm/blob/main/rus/rus.training_text
[3] 
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters
[4] https://github.com/tesseract-ocr/tessdata_best/blob/main/rus.traineddata

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dcc37079-d5af-47f0-bb12-28bbaf7195a4n%40googlegroups.com.

Reply via email to