I had many similar issues, especially with input with Yuan (rounded) 
fonts.  In the end I found the exact font used and ran additional training 
with the new font.  

Even after retraining some characters would be confused with others (like 
your case).  To strengthen those, I included many instances of those 
characters in various combinations in the training data and ran the 
training again.
 eg:
大*叔*中文
*叔*大中文
*叔*
*叔**叔*
大中文*叔*
*叔/**叔*
etc

Recognition got much much better, but still have an issue when there is an 
ellipsis or three dots after the text, in which case it doesn't output 
anything at all!  See conversation here 
<https://groups.google.com/g/tesseract-ocr/c/hwX_YFRUXf4>.

eg, this image below produces no output at all...  No idea why!
[image: bad_sub_243.png]

On Friday, July 19, 2024 at 12:28:37 PM UTC+8 John wrote:

> to tesseract-ocr  Is version
> On Friday, July 19, 2024 at 12:32:25 AM UTC+7 [email protected] wrote:
>
>> 你好,请问一下用的是哪个版本呀,方便分享一下你的chi_sim 和chi_sim_vert 的文件嘛?
>>
>> 在2024年3月17日星期日 UTC+8 00:41:13<[email protected]> 写道:
>>
>>> Hello, 
>>>
>>> I am making a transcrypt of YT wideos using tessaract. 
>>> Images I input to tessaract look like this:
>>> [image: aftercut29.0.jpg]
>>>
>>> The output is mostly correct but sometimes the same character give 
>>> numerous output.
>>> Example: 
>>> Input:
>>> [image: aftercut3.0.jpg]
>>> Output: 大*叔*中文 - CORRECT
>>>
>>> Input:
>>> [image: aftercut10.5.jpg] 
>>> Output: 今天不是3位 大*档* - INCORRECT
>>>
>>> In preparation of the images I use:
>>>
>>>    -  *dilatation*, 
>>>    - *cropping the area* of image containg characters
>>>    -  I add *borders*.
>>>
>>>  For dilatation I use 2x2 kernel and the border is 2px thick.
>>>  For segmentation method I am currently experimentig with *psg --7 *and 
>>> *psg 
>>> -- 13*. psg --7 seems to give a bit better results. Of course the 
>>> language is : *lang='chi_sim'*
>>>
>>> Could you give my any advice how to improve the robustness of the output?
>>>
>>> Thank you in advance,
>>> Jan
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/68fc91e0-2ab7-41ab-ba12-f7b7ccbddb6bn%40googlegroups.com.

Reply via email to