[tesseract-ocr] Issue with Colon Recognition After Fine-Tuning Tesseract 5.5.3 on Russian Dataset

Sandeep G Mon, 03 Nov 2025 06:09:57 -0800


I’m currently working on fine-tuning the Tesseract OCR model (version 
5.5.3) and encountered an issue related to symbol and digit recognition.

With the original Tesseract weight file, the model was missing the colon (
: ) symbol. To address this, I fine-tuned the model using 500 ROIs. After
fine-tuning, the model successfully recognized the colon; however, some
digits began showing false positives — for example, ‘5’ was sometimes
recognized as ‘6’.

When I used a combination of the original Russian model and the fine-tuned
Russian model, the digits were recognized correctly, but the colon symbol
was again missing.

*Approaches Tried (but didn’t yield the desired results):*

Converted the images to binary
-

Performed noise removal
-

Applied CLAHE
-

Tried all PSM modes
-

Enabled early stopping to avoid overfitting

*Training Command Used:*
make training MODEL_NAME=rusfinetune START_MODEL=rus MAX_ITERATIONS=4000
STOP_TRAINING_CONVERGED=true TESSDATA=/usr/local/share/tessdata

May I know what could be the root cause of this issue or any suggestions to
resolve it?

For your reference, I’ve attached the sample images.

sample_Images
<https://acgworld-my.sharepoint.com/:f:/g/personal/sandeep_reddy_acg-world_com/EpMejgMNpZ1BmeETDsPCsHkBzP3c6dsr4ZlYfFKMc6PUyQ?e=vn0FGv>

Thank you for your time and support.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/40fad86b-8050-43f6-b613-dd096cfa5532n%40googlegroups.com.

[tesseract-ocr] Issue with Colon Recognition After Fine-Tuning Tesseract 5.5.3 on Russian Dataset

Reply via email to