Re: [tesseract-ocr] Make russian_with_accent traineddata file

Salut Romain,

On Friday, February 9, 2024 at 6:03:02 AM UTC-5 Romain B. (Le Belge) wrote:

I'm trying to fix this issue. By what i have read, i think i need to
re-train the russian language in tesseract for it to support accents.
I found this
<https://github.com/tesseract-ocr/langdata/tree/main/rus_accent> folder in
langdata, but can't find a way to use it to re-train the russian language.

How can i use the rus_accent folder and its files to easily re-train the
russian language ?

Looking at the history [1] for that folder makes me think that it was an
incomplete work-in-progress, but it's also for the previous OCR engine.
You want to look at langdata_lstm/rus [2] for your training text and then
using the fine tuning directions [3] with the rus model from
tessdata_best/rus.traineddata [4]. This would involve going through and
adding accents to some proportion of the vowels and then rerunning the
training. For example, there are 10 occurrences of the string балкон and
you could change some or all of them to have your accent mark (I don't know
if there's a standard convention for encoding them).

As a caveat, I don't know if adding accented variants of all 10 vowels
would be considered "a few characters" for the purposes of the finetuning
instructions.

Good luck!

Tom

[1] https://github.com/tesseract-ocr/langdata/commits/main/rus_accent
[2]
https://github.com/tesseract-ocr/langdata_lstm/blob/main/rus/rus.training_text
[3]
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters
[4] https://github.com/tesseract-ocr/tessdata_best/blob/main/rus.traineddata

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/dcc37079-d5af-47f0-bb12-28bbaf7195a4n%40googlegroups.com.

Reply via email to