Hello! I want to use custom traineddata, but the performance is bad, so I 
want to ask for advice.

I have a font that I need to train. So I set the base model to kor for 
Korean, and created ground-truth for a specific font with the kor 
training_text file. And I trained with that data. The cmd code that I 
trained is as follows.

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=HDharmony 
START_MODEL=kor TESSDATA=../tesseract/tessdata MAX_ITERATION=1000

And I have the custom.traineddata that I got. So I tried to OCR my test.pdf 
again. The lang I used at that time was lang: List[str] = ["custom", 
"chi_sim", "eng"]

But the performance is clearly worse than when I use the default 
traineddata and do OCR with lang: List[str] = ["kor", "chi_sim", "eng"].

What is the problem with this?

I think the generality of the traineddata for Korean has decreased while I 
am training with a specific font. How can I solve this? 

Should I increase the iteration? Or would it be better to train with a 
specific font for chi_sim or eng and do OCR with lang: List[str] = 
["custom_kor", "custom_chi_sim", "custom_eng"]?

 Or can I train for Korean, English, and Chinese characters at the same 
time and create one custom_total.traineddata?


I don't know which method is right. 

I would really appreciate it if you could explain it in detail. I will wait 
for your answer.
Thank you.


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/e40f4681-abe5-4bb5-9ceb-cdf22e1a9248n%40googlegroups.com.

Reply via email to