for numbers i used this and works fine with AEN numbers https://github.com/ahmed-tea/tessdata_Arabic_Numbers
On Thursday, 13 August 2020 13:41:12 UTC+2, Anuradha B wrote: > > I am trying to extract the arabic dates and numbers from the national ID > card.I am using the following code in Anaconda- Jupiter Notebook.I ahve > aalso attached the image I have used and the outputs wrt to using > grayscale,threshold,canny,image etc functions..But all the text extracted > does not extract the dates and numerals.[I have also installed Tesseract > alpha4.0 version.]Please suggest. > > import cv2 > import matplotlib.pyplot as plt > from PIL import Image > import pytesseract > import numpy as np > from matplotlib import pyplot as plt > pytesseract.pytesseract.tesseract_cmd = r'C:\Program > Files\Tesseract-OCR\tesseract.exe' > > import cv2 > import numpy as np > > img = cv2.imread('image2.jpg') > > # get grayscale image > def get_grayscale(image): > return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) > > # noise removal > def remove_noise(image): > return cv2.medianBlur(image,5) > > #thresholding > def thresholding(image): > return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + > cv2.THRESH_OTSU)[1] > > #dilation > def dilate(image): > kernel = np.ones((5,5),np.uint8) > return cv2.dilate(image, kernel, iterations = 1) > > #erosion > def erode(image): > kernel = np.ones((5,5),np.uint8) > return cv2.erode(image, kernel, iterations = 1) > > #opening - erosion followed by dilation > def opening(image): > kernel = np.ones((5,5),np.uint8) > return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel) > > #canny edge detection > def canny(image): > return cv2.Canny(image, 100, 200) > > #skew correction > def deskew(image): > coords = np.column_stack(np.where(image > 0)) > angle = cv2.minAreaRect(coords)[-1] > if angle < -45: > angle = -(90 + angle) > else: > angle = -angle > (h, w) = image.shape[:2] > center = (w // 2, h // 2) > M = cv2.getRotationMatrix2D(center, angle, 1.0) > rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, > borderMode=cv2.BORDER_REPLICATE) > return rotated > > #template matching > def match_template(image, template): > return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) > > image = cv2.imread('image2.jpg') > > gray = get_grayscale(image) > thresh = thresholding(gray) > opening = opening(gray) > canny = canny(gray) > > text = pytesseract.image_to_string(image,lang='eng+ara') > print(text) > print('----------------------------------------------------------------') > text = pytesseract.image_to_string(gray,lang='eng+ara') > print(text) > print('----------------------------------------------------------------') > text = pytesseract.image_to_string(thresh,lang='eng+ara') > print(text) > print('----------------------------------------------------------------') > text = pytesseract.image_to_string(opening,lang='eng+ara') > print(text) > print('----------------------------------------------------------------') > text = pytesseract.image_to_string(canny,lang='eng+ara') > print(text) > On Sunday, 12 July, 2020 at 4:30:40 pm UTC+5:30 shree wrote: > >> What character are you trying to add? >> Please share the training data to try and replicate the issue. >> >> >> On Sun, Jul 12, 2020, 15:35 Eliyaz L <write2...@gmail.com> wrote: >> >>> Hi, >>> >>> >>> My use case is on Arabic document, the pre retrained ara.traineddata are >>> good but not perfect. so i wish to fine tune ara.traineddata, if the >>> results are not satisfying then have train my own custom data. >>> >>> >>> please suggest me for the following: >>> >>> 1. for my use case in Arabic text, problem is in one character which >>> is always predicting wrong. so do i need to add the document font >>> (traditional arabic font) and train? if so pls provide the procedure or >>> link to add one font in pre training ara.traineddata. >>> 2. if fine tuning or training from scratch, how many gt.txt files i >>> need and how many characters needs to be there in each file? and any apx >>> iterations if you know? >>> 3. for number, the prediction is totally wrong on Arabic numbers, so >>> do i need to start from scratch or need to fine tune? if any then how to >>> prepare datasets for the same. >>> 4. how to decide the max_iterations is there any ratio of datasets >>> and iteration. >>> >>> >>> *Below are my **trails**:* >>> >>> >>> *For Arabic Numbers:* >>> >>> >>> -> i tried to custom train only Arabic numbers. >>> -> i wrote a script to write 100,000 numbers in multiple gt.txt files. >>> 100s of character in each gt.txt file. >>> -> then one script to convert text to image (text2image) which should be >>> more like scanned image. >>> -> parameters used in the below order. >>> >>> text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir >>> /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image >>> false --rotate_image --exposure 2 --resolution 300 >>> >>> 1. How much dataset i need to prepare for arabic number, as of now >>> required only for 2 specific fonts which i already have. >>> 2. Will dateset be duplicate if i follow this procedure, if yes is >>> there any way to avoid it. >>> 3. Is that good way to create more gt.txt files with less characters >>> in it (for eg 50,000 gt files with 10 numbers in each file) or less >>> gt.txt >>> files with more characters (for eg 1000 gt files with 500 numbers in >>> each >>> file). >>> >>> If possible please guide me the procedure for datasets preparation. >>> >>> For testing I tried 50,000 eng number, with each number in one gt.txt >>> file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration >>> but it fails. >>> >>> >>> *For Arabic Text:* >>> >>> >>> -> prepared around 23k gt.txt files each having one sentence >>> >>> -> generated .box and small .tifs files for all gt.txt files using 1 >>> font (traditional Arabic font) >>> >>> -> used the tesstrain git and trained for 20,000 iteration >>> >>> -> after training generated foo.traineddata with 0.03 error rate >>> >>> -> did prediction an the real data, it is working perfect for the >>> perticular character which on pre trained (ara.traineddata) failes. but >>> when comes to overall accuracy the pre trained (ara.traineddata) performs >>> better except that one character. >>> >>> >>> >>> *Summery:* >>> >>> >>> >>> - how to fix one character in pre >>> trained (ara.traineddata) model or if not possible how to custom >>> train from scratch or is there a way to annotate on real image and >>> prepare >>> dateset, pls suggest the best practice? >>> - how to prepare Arabic number dataset and train it. if custom >>> training on number not possible then can arabic numbers added with pre >>> trained model (ara.traineddata) >>> >>> >>> >>> GitHub link used for custom training Arabic text and numbers: >>> https://github.com/tesseract-ocr/tesstrain >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ae16a6eb-e697-40cf-b539-f33cbf876416o%40googlegroups.com.