I wanted to ask what are the most optimal pre-processing techniques for my 
case in the letters that I would like to read. I am using pytesseract for 
character recognition, but sometimes my characters are not recognized 
properly.

I have added couple samples of images I am using, but am using more.


The most common issues are:
- 5 get recognized as S (but not vice versa)
- S gets recognized as O (but not vice versa)
- / gets recognized as I

I have tried multiple techniques, but if one technique fixes an issue, then 
another issue pops up. The character recognition works most of the time, 
but it is not consistent, I would say ~80%. I can take a picutre, do the 
processing and recognition works, then take a new picture in same 
conditions and the recognition does not work, seems like recognition is 
within the tolerance of noise

I believe that a large part of issue is that the font is in bold. For 
example, I did notice that the wider / is, the more likely it is to be 
recognized as I. I have tried cv2.resize(fx=2, fy=2) + cv2.erode(), but 
then for some reason I recognized that the thicker the 5 is, the less 
likely it is to be recognized as S. At the same time , if characters are 
thicker, or I reduce the threshold in binarization, the hole in 4 gets 
filled in and causes the problems.

I cannot change the font. I have tried taking picture at various exposures, 
nothing does seem to fix the core of the issue. I This is the best focus I 
am able to obtain. I cannot whitelist certain symbols, because both letters 
and numbers are possible. I do not want to do .replace('SX', '5X') because 
the point of the check is to validate the that the label has been printer 
correctly.

Techniques I have tried:
- Regular binarization
- OTSU binarization
- Adaptive thresholding
- Resize + erode()
- Upscale image with cv2.dnn_superres, kinda better, but too slow, because 
I have a lot of images to process
- Histogram equalization before any of the above

NOTE: I am able to get the solution for sample images, I am unable to get 
the consistent solution if images slightly vary, I cannot get it to work 
100% of the time.

Can someone provide info on how would you go about cleaning up these images

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/59cbd128-37c6-4c06-abbc-f79a05d95a5dn%40googlegroups.com.

Reply via email to