I had many similar issues, especially with input with Yuan (rounded) fonts. In the end I found the exact font used and ran additional training with the new font.
Even after retraining some characters would be confused with others (like your case). To strengthen those, I included many instances of those characters in various combinations in the training data and ran the training again. eg: 大*叔*中文 *叔*大中文 *叔* *叔**叔* 大中文*叔* *叔/**叔* etc Recognition got much much better, but still have an issue when there is an ellipsis or three dots after the text, in which case it doesn't output anything at all! See conversation here <https://groups.google.com/g/tesseract-ocr/c/hwX_YFRUXf4>. eg, this image below produces no output at all... No idea why! [image: bad_sub_243.png] On Friday, July 19, 2024 at 12:28:37 PM UTC+8 John wrote: > to tesseract-ocr Is version > On Friday, July 19, 2024 at 12:32:25 AM UTC+7 [email protected] wrote: > >> 你好,请问一下用的是哪个版本呀,方便分享一下你的chi_sim 和chi_sim_vert 的文件嘛? >> >> 在2024年3月17日星期日 UTC+8 00:41:13<[email protected]> 写道: >> >>> Hello, >>> >>> I am making a transcrypt of YT wideos using tessaract. >>> Images I input to tessaract look like this: >>> [image: aftercut29.0.jpg] >>> >>> The output is mostly correct but sometimes the same character give >>> numerous output. >>> Example: >>> Input: >>> [image: aftercut3.0.jpg] >>> Output: 大*叔*中文 - CORRECT >>> >>> Input: >>> [image: aftercut10.5.jpg] >>> Output: 今天不是3位 大*档* - INCORRECT >>> >>> In preparation of the images I use: >>> >>> - *dilatation*, >>> - *cropping the area* of image containg characters >>> - I add *borders*. >>> >>> For dilatation I use 2x2 kernel and the border is 2px thick. >>> For segmentation method I am currently experimentig with *psg --7 *and >>> *psg >>> -- 13*. psg --7 seems to give a bit better results. Of course the >>> language is : *lang='chi_sim'* >>> >>> Could you give my any advice how to improve the robustness of the output? >>> >>> Thank you in advance, >>> Jan >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/68fc91e0-2ab7-41ab-ba12-f7b7ccbddb6bn%40googlegroups.com.

