[tesseract-ocr] OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition

Jokūbas Žižiūnas Fri, 03 Jan 2025 15:18:04 -0800

I wanted to ask what are the most optimal pre-processing techniques for my 
case in the letters that I would like to read. I am using pytesseract for 
character recognition, but sometimes my characters are not recognized 
properly.

I have added couple samples of images I am using, but am using more.

The most common issues are:
- 5 get recognized as S (but not vice versa)
- S gets recognized as O (but not vice versa)
- / gets recognized as I

I have tried multiple techniques, but if one technique fixes an issue, then
another issue pops up. The character recognition works most of the time,
but it is not consistent, I would say ~80%. I can take a picutre, do the
processing and recognition works, then take a new picture in same
conditions and the recognition does not work, seems like recognition is
within the tolerance of noise

I believe that a large part of issue is that the font is in bold. For
example, I did notice that the wider / is, the more likely it is to be
recognized as I. I have tried cv2.resize(fx=2, fy=2) + cv2.erode(), but
then for some reason I recognized that the thicker the 5 is, the less
likely it is to be recognized as S. At the same time , if characters are
thicker, or I reduce the threshold in binarization, the hole in 4 gets
filled in and causes the problems.

I cannot change the font. I have tried taking picture at various exposures,
nothing does seem to fix the core of the issue. I This is the best focus I
am able to obtain. I cannot whitelist certain symbols, because both letters
and numbers are possible. I do not want to do .replace('SX', '5X') because
the point of the check is to validate the that the label has been printer
correctly.

Techniques I have tried:
- Regular binarization
- OTSU binarization
- Adaptive thresholding
- Resize + erode()
- Upscale image with cv2.dnn_superres, kinda better, but too slow, because
I have a lot of images to process
- Histogram equalization before any of the above

NOTE: I am able to get the solution for sample images, I am unable to get
the consistent solution if images slightly vary, I cannot get it to work
100% of the time.

Can someone provide info on how would you go about cleaning up these images

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/59cbd128-37c6-4c06-abbc-f79a05d95a5dn%40googlegroups.com.

[tesseract-ocr] OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition

Reply via email to