As per my question on StackOverflow: PyTesseract not recognizing decimals <https://stackoverflow.com/questions/64203559/pytesseract-not-recognizing-decimals>
I'm using PyTesseract to recognise text in table cells. When it comes to recognising drug doses with decimal points, the OCR fails to recognise the period character ( . ) , though is accurate for everything else. I'm using tesseract v5.0.0-alpha.20200328 on Windows 10. My pre-processing consists of upscaling by 400% using cubic, conversion to black and white, dilation and erosion, morphology, and blurring. I've tried a decent combination of all of these (as well as each on their own), and nothing has recognized the . I've tried --psm of various values as well as a character whitelist. I believe the font is Sergoe UI. Before processing: [image: S87rd.png] <https://i.stack.imgur.com/S87rd.png> After processing: [image: OFjoL.png] <https://i.stack.imgur.com/OFjoL.png> PyTesseract output: 25mg »p Processing code attached -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5c754a36-a0e4-427f-9650-f41200a1cda5n%40googlegroups.com.
import cv2, pytesseract import numpy as np image = cv2.imread( '01.png' ) upscaled_image = cv2.resize(image, None, fx = 4, fy = 4, interpolation = cv2.INTER_CUBIC) bw_image = cv2.cvtColor(upscaled_image, cv2.COLOR_BGR2GRAY) kernel = np.ones((2, 2), np.uint8) dilated_image = cv2.dilate(bw_image, kernel, iterations=1) eroded_image = cv2.erode(dilated_image, kernel, iterations=1) thresh = cv2.threshold(eroded_image, 205, 255, cv2.THRESH_BINARY)[1] kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3)) morh_image = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel) blur_image = cv2.threshold(cv2.bilateralFilter(morh_image, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1] final_image = blur_image text = pytesseract.image_to_string(final_image, lang='eng', config='--psm 10')