[tesseract-ocr] Can I train a model for multiple languages at the same time?

nahye koo Tue, 15 Jul 2025 19:43:15 -0700


Hello! I want to use custom traineddata, but the performance is bad, so I 
want to ask for advice.

I have a font that I need to train. So I set the base model to kor for
Korean, and created ground-truth for a specific font with the kor
training_text file. And I trained with that data. The cmd code that I
trained is as follows.

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=HDharmony
START_MODEL=kor TESSDATA=../tesseract/tessdata MAX_ITERATION=1000

And I have the custom.traineddata that I got. So I tried to OCR my test.pdf
again. The lang I used at that time was lang: List[str] = ["custom",
"chi_sim", "eng"]

But the performance is clearly worse than when I use the default
traineddata and do OCR with lang: List[str] = ["kor", "chi_sim", "eng"].

What is the problem with this?

I think the generality of the traineddata for Korean has decreased while I
am training with a specific font. How can I solve this?

Should I increase the iteration? Or would it be better to train with a
specific font for chi_sim or eng and do OCR with lang: List[str] =
["custom_kor", "custom_chi_sim", "custom_eng"]?

Or can I train for Korean, English, and Chinese characters at the same
time and create one custom_total.traineddata?

I don't know which method is right.

I would really appreciate it if you could explain it in detail. I will wait
for your answer.
Thank you.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/e40f4681-abe5-4bb5-9ceb-cdf22e1a9248n%40googlegroups.com.

[tesseract-ocr] Can I train a model for multiple languages at the same time?

Reply via email to