[tesseract-ocr] Tesseract OCR reads character wrongly, reading extra characters.

Jiansen Chan Sun, 04 May 2025 23:00:34 -0700

I custom trained a model, the configuration is shown as below:
custom_config = f'--oem 3 --psm 6 -l jpn22


However, when I use a debugger to check what is actually being scanned this 
is shown. Sis not able to be read as is assumed to have two different 
characters in it (hence why there are two bounding boxes in the picture 
with the "S") and for teh "3L" picture it is shown as "3LL". 

The language model I'm using is for Japanese Kanji but it is supposed to be 
able to read the letters as the unicharset for jpn model comes together 
with Roman capital letters. I've tried reducing the number of training data 
with the repeated samples for this, so i don't think it is a matter of 
overfitting. 

Can I get some advice on this?




-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/a6304bf7-36ca-4d1b-921b-6e2c6bf5a629n%40googlegroups.com.

[tesseract-ocr] Tesseract OCR reads character wrongly, reading extra characters.

Reply via email to