Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Mahmoud Mabrouk Thu, 13 Aug 2020 05:31:33 -0700

for numbers i used this and works fine with AEN numbers 
https://github.com/ahmed-tea/tessdata_Arabic_Numbers


On Thursday, 13 August 2020 13:41:12 UTC+2, Anuradha B wrote:
>
> I am trying to extract the arabic dates and numbers from the national ID 
> card.I am using the following code in Anaconda- Jupiter Notebook.I ahve 
> aalso attached the image I have used and the outputs wrt to using 
> grayscale,threshold,canny,image etc functions..But all the text extracted 
> does not extract the dates and numerals.[I have also installed Tesseract 
> alpha4.0 version.]Please suggest.
>
> import cv2
> import matplotlib.pyplot as plt
> from PIL import Image
> import pytesseract
> import numpy as np
> from matplotlib import pyplot as plt
> pytesseract.pytesseract.tesseract_cmd = r'C:\Program 
> Files\Tesseract-OCR\tesseract.exe'
>
> import cv2
> import numpy as np
>
> img = cv2.imread('image2.jpg')
>
> # get grayscale image
> def get_grayscale(image):
>     return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
>
> # noise removal
> def remove_noise(image):
>     return cv2.medianBlur(image,5)
>  
> #thresholding
> def thresholding(image):
>     return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + 
> cv2.THRESH_OTSU)[1]
>
> #dilation
> def dilate(image):
>     kernel = np.ones((5,5),np.uint8)
>     return cv2.dilate(image, kernel, iterations = 1)
>     
> #erosion
> def erode(image):
>     kernel = np.ones((5,5),np.uint8)
>     return cv2.erode(image, kernel, iterations = 1)
>
> #opening - erosion followed by dilation
> def opening(image):
>     kernel = np.ones((5,5),np.uint8)
>     return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
>
> #canny edge detection
> def canny(image):
>     return cv2.Canny(image, 100, 200)
>
> #skew correction
> def deskew(image):
>     coords = np.column_stack(np.where(image > 0))
>     angle = cv2.minAreaRect(coords)[-1]
>     if angle < -45:
>         angle = -(90 + angle)
>     else:
>         angle = -angle
>     (h, w) = image.shape[:2]
>     center = (w // 2, h // 2)
>     M = cv2.getRotationMatrix2D(center, angle, 1.0)
>     rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, 
> borderMode=cv2.BORDER_REPLICATE)
>     return rotated
>
> #template matching
> def match_template(image, template):
>     return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) 
>
> image = cv2.imread('image2.jpg')
>
> gray = get_grayscale(image)
> thresh = thresholding(gray)
> opening = opening(gray)
> canny = canny(gray)
>
> text = pytesseract.image_to_string(image,lang='eng+ara')
> print(text)
> print('----------------------------------------------------------------')
> text = pytesseract.image_to_string(gray,lang='eng+ara')
> print(text)
> print('----------------------------------------------------------------')
> text = pytesseract.image_to_string(thresh,lang='eng+ara')
> print(text)
> print('----------------------------------------------------------------')
> text = pytesseract.image_to_string(opening,lang='eng+ara')
> print(text)
> print('----------------------------------------------------------------')
> text = pytesseract.image_to_string(canny,lang='eng+ara')
> print(text)
> On Sunday, 12 July, 2020 at 4:30:40 pm UTC+5:30 shree wrote:
>
>> What character are you trying to add?
>> Please share the training data to try and replicate the issue.
>>
>>
>> On Sun, Jul 12, 2020, 15:35 Eliyaz L <write2...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> My use case is on Arabic document, the pre retrained ara.traineddata are 
>>> good but not perfect. so i wish to fine tune ara.traineddata, if the 
>>> results are not satisfying then have train my own custom data.
>>>
>>>
>>> please suggest me for the following:
>>>
>>>    1. for my use case in Arabic text, problem is in one character which 
>>>    is always predicting wrong. so do i need to add the document font 
>>>    (traditional arabic font) and train? if so pls provide the procedure or 
>>>    link to add one font in pre training ara.traineddata.
>>>    2. if fine tuning or training from scratch, how many gt.txt files i 
>>>    need and how many characters needs to be there in each file? and any apx 
>>>    iterations if you know?
>>>    3. for number, the prediction is totally wrong on Arabic numbers, so 
>>>    do i need to start from scratch or need to fine tune? if any then how to 
>>>    prepare datasets for the same.
>>>    4. how to decide the max_iterations is there any ratio of datasets 
>>>    and iteration.
>>>
>>>
>>> *Below are my **trails**:*
>>>
>>>
>>> *For Arabic Numbers:*
>>>
>>>
>>> -> i tried to custom train only Arabic numbers.
>>> -> i wrote a script to write 100,000 numbers in multiple gt.txt files. 
>>> 100s of character in each gt.txt file.
>>> -> then one script to convert text to image (text2image) which should be 
>>> more like scanned image.
>>> -> parameters used in the below order.
>>>
>>> text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir 
>>> /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image 
>>> false --rotate_image --exposure 2 --resolution 300
>>>
>>>    1. How much dataset i need to prepare for arabic number, as of now 
>>>    required only for 2 specific fonts which i already have.
>>>    2. Will dateset be duplicate if i follow this procedure, if yes is 
>>>    there any way to avoid it.
>>>    3. Is that good way to create more gt.txt files with less characters 
>>>    in it (for eg 50,000 gt files with 10 numbers in each file) or less 
>>> gt.txt 
>>>    files with more characters (for eg 1000 gt files with 500 numbers in 
>>> each 
>>>    file).  
>>>
>>> If possible please guide me the procedure for datasets preparation.
>>>
>>> For testing I tried 50,000 eng number, with each number in one gt.txt 
>>> file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration 
>>> but it fails.
>>>
>>>
>>> *For Arabic Text:*
>>>
>>>
>>> -> prepared around 23k gt.txt files each having one sentence
>>>
>>> -> generated .box and small .tifs files for all gt.txt files using 1 
>>> font (traditional Arabic font)
>>>
>>> -> used the tesstrain git and trained for 20,000 iteration
>>>
>>> -> after training generated foo.traineddata with 0.03 error rate
>>>
>>> -> did prediction an the real data, it is working perfect for the 
>>> perticular character which on pre trained (ara.traineddata) failes. but 
>>> when comes to overall accuracy the pre trained (ara.traineddata) performs 
>>> better except that one character.
>>>
>>>
>>>
>>> *Summery:*
>>>
>>>
>>>
>>>    - how to fix one character in pre 
>>>    trained (ara.traineddata) model or if not possible how to custom 
>>>    train from scratch or is there a way to annotate on real image and 
>>> prepare 
>>>    dateset, pls suggest the best practice?
>>>    - how to prepare Arabic number dataset and train it. if custom 
>>>    training on number not possible then can arabic numbers added with pre 
>>>    trained model (ara.traineddata)  
>>>
>>>  
>>>
>>> GitHub link used for custom training Arabic text and numbers: 
>>> https://github.com/tesseract-ocr/tesstrain
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ae16a6eb-e697-40cf-b539-f33cbf876416o%40googlegroups.com.

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Reply via email to