I’m currently working on fine-tuning the Tesseract OCR model (version
5.5.3) and encountered an issue related to symbol and digit recognition.
With the original Tesseract weight file, the model was missing the colon (
: ) symbol. To address this, I fine-tuned the model using 500 ROIs. After
fine-tuning, the model successfully recognized the colon; however, some
digits began showing false positives — for example, ‘5’ was sometimes
recognized as ‘6’.
When I used a combination of the original Russian model and the fine-tuned
Russian model, the digits were recognized correctly, but the colon symbol
was again missing.
*Approaches Tried (but didn’t yield the desired results):*
-
Converted the images to binary
-
Performed noise removal
-
Applied CLAHE
-
Tried all PSM modes
-
Enabled early stopping to avoid overfitting
*Training Command Used:*
make training MODEL_NAME=rusfinetune START_MODEL=rus MAX_ITERATIONS=4000
STOP_TRAINING_CONVERGED=true TESSDATA=/usr/local/share/tessdata
May I know what could be the root cause of this issue or any suggestions to
resolve it?
For your reference, I’ve attached the sample images.
sample_Images
<https://acgworld-my.sharepoint.com/:f:/g/personal/sandeep_reddy_acg-world_com/EpMejgMNpZ1BmeETDsPCsHkBzP3c6dsr4ZlYfFKMc6PUyQ?e=vn0FGv>
Thank you for your time and support.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/40fad86b-8050-43f6-b613-dd096cfa5532n%40googlegroups.com.