Hello! I want to use custom traineddata, but the performance is bad, so I want to ask for advice.
I have a font that I need to train. So I set the base model to kor for Korean, and created ground-truth for a specific font with the kor training_text file. And I trained with that data. The cmd code that I trained is as follows. TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=HDharmony START_MODEL=kor TESSDATA=../tesseract/tessdata MAX_ITERATION=1000 And I have the custom.traineddata that I got. So I tried to OCR my test.pdf again. The lang I used at that time was lang: List[str] = ["custom", "chi_sim", "eng"] But the performance is clearly worse than when I use the default traineddata and do OCR with lang: List[str] = ["kor", "chi_sim", "eng"]. What is the problem with this? I think the generality of the traineddata for Korean has decreased while I am training with a specific font. How can I solve this? Should I increase the iteration? Or would it be better to train with a specific font for chi_sim or eng and do OCR with lang: List[str] = ["custom_kor", "custom_chi_sim", "custom_eng"]? Or can I train for Korean, English, and Chinese characters at the same time and create one custom_total.traineddata? I don't know which method is right. I would really appreciate it if you could explain it in detail. I will wait for your answer. Thank you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/e40f4681-abe5-4bb5-9ceb-cdf22e1a9248n%40googlegroups.com.

