Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Eliyaz L Sun, 12 Jul 2020 05:32:14 -0700

Always the letter "لا" is predicted as "ال" .

My training data here 
<https://drive.google.com/drive/folders/18c1lIjObtBrG8DdtjFF4U5WHwZJeHN45?usp=sharing>
My prediction document will be in Traditional Arabic font here 
<https://fontzone.net/font-details/traditional-arabic>.


Below shell command used to generate tif and box file from gt file: 

for i in $(seq -f "%06g" 006601 006798)
do
 echo $i
 text2image --xsize 3600 --ysize 300 --text $i.gt.txt --outputbase /home/
user/Desktop/$i --font 'Traditional Arabic' --fonts_dir /home/user/.local/
share/fonts/
done


Input Image:

[image: firstName.jpg]
 

On Sunday, July 12, 2020 at 2:00:40 PM UTC+3, shree wrote:
>
> What character are you trying to add?
> Please share the training data to try and replicate the issue.
>
>
> On Sun, Jul 12, 2020, 15:35 Eliyaz L <[email protected] <javascript:>> 
> wrote:
>
>> Hi,
>>
>>
>> My use case is on Arabic document, the pre retrained ara.traineddata are 
>> good but not perfect. so i wish to fine tune ara.traineddata, if the 
>> results are not satisfying then have train my own custom data.
>>
>>
>> please suggest me for the following:
>>
>>    1. for my use case in Arabic text, problem is in one character which 
>>    is always predicting wrong. so do i need to add the document font 
>>    (traditional arabic font) and train? if so pls provide the procedure or 
>>    link to add one font in pre training ara.traineddata.
>>    2. if fine tuning or training from scratch, how many gt.txt files i 
>>    need and how many characters needs to be there in each file? and any apx 
>>    iterations if you know?
>>    3. for number, the prediction is totally wrong on Arabic numbers, so 
>>    do i need to start from scratch or need to fine tune? if any then how to 
>>    prepare datasets for the same.
>>    4. how to decide the max_iterations is there any ratio of datasets 
>>    and iteration.
>>
>>
>> *Below are my **trails**:*
>>
>>
>> *For Arabic Numbers:*
>>
>>
>> -> i tried to custom train only Arabic numbers.
>> -> i wrote a script to write 100,000 numbers in multiple gt.txt files. 
>> 100s of character in each gt.txt file.
>> -> then one script to convert text to image (text2image) which should be 
>> more like scanned image.
>> -> parameters used in the below order.
>>
>> text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir 
>> /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image 
>> false --rotate_image --exposure 2 --resolution 300
>>
>>    1. How much dataset i need to prepare for arabic number, as of now 
>>    required only for 2 specific fonts which i already have.
>>    2. Will dateset be duplicate if i follow this procedure, if yes is 
>>    there any way to avoid it.
>>    3. Is that good way to create more gt.txt files with less characters 
>>    in it (for eg 50,000 gt files with 10 numbers in each file) or less 
>> gt.txt 
>>    files with more characters (for eg 1000 gt files with 500 numbers in each 
>>    file).  
>>
>> If possible please guide me the procedure for datasets preparation.
>>
>> For testing I tried 50,000 eng number, with each number in one gt.txt 
>> file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration 
>> but it fails.
>>
>>
>> *For Arabic Text:*
>>
>>
>> -> prepared around 23k gt.txt files each having one sentence
>>
>> -> generated .box and small .tifs files for all gt.txt files using 1 font 
>> (traditional Arabic font)
>>
>> -> used the tesstrain git and trained for 20,000 iteration
>>
>> -> after training generated foo.traineddata with 0.03 error rate
>>
>> -> did prediction an the real data, it is working perfect for the 
>> perticular character which on pre trained (ara.traineddata) failes. but 
>> when comes to overall accuracy the pre trained (ara.traineddata) performs 
>> better except that one character.
>>
>>
>>
>> *Summery:*
>>
>>
>>
>>    - how to fix one character in pre 
>>    trained (ara.traineddata) model or if not possible how to custom 
>>    train from scratch or is there a way to annotate on real image and 
>> prepare 
>>    dateset, pls suggest the best practice?
>>    - how to prepare Arabic number dataset and train it. if custom 
>>    training on number not possible then can arabic numbers added with pre 
>>    trained model (ara.traineddata)  
>>
>>  
>>
>> GitHub link used for custom training Arabic text and numbers: 
>> https://github.com/tesseract-ocr/tesstrain
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c0bb28fe-8b8d-4b94-a8f8-84f3e17cf948o%40googlegroups.com.

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Reply via email to