Always the letter "لا" is predicted as "ال" . My training data here <https://drive.google.com/drive/folders/18c1lIjObtBrG8DdtjFF4U5WHwZJeHN45?usp=sharing> My prediction document will be in Traditional Arabic font here <https://fontzone.net/font-details/traditional-arabic>.
Below shell command used to generate tif and box file from gt file: for i in $(seq -f "%06g" 006601 006798) do echo $i text2image --xsize 3600 --ysize 300 --text $i.gt.txt --outputbase /home/ user/Desktop/$i --font 'Traditional Arabic' --fonts_dir /home/user/.local/ share/fonts/ done Input Image: [image: firstName.jpg] On Sunday, July 12, 2020 at 2:00:40 PM UTC+3, shree wrote: > > What character are you trying to add? > Please share the training data to try and replicate the issue. > > > On Sun, Jul 12, 2020, 15:35 Eliyaz L <[email protected] <javascript:>> > wrote: > >> Hi, >> >> >> My use case is on Arabic document, the pre retrained ara.traineddata are >> good but not perfect. so i wish to fine tune ara.traineddata, if the >> results are not satisfying then have train my own custom data. >> >> >> please suggest me for the following: >> >> 1. for my use case in Arabic text, problem is in one character which >> is always predicting wrong. so do i need to add the document font >> (traditional arabic font) and train? if so pls provide the procedure or >> link to add one font in pre training ara.traineddata. >> 2. if fine tuning or training from scratch, how many gt.txt files i >> need and how many characters needs to be there in each file? and any apx >> iterations if you know? >> 3. for number, the prediction is totally wrong on Arabic numbers, so >> do i need to start from scratch or need to fine tune? if any then how to >> prepare datasets for the same. >> 4. how to decide the max_iterations is there any ratio of datasets >> and iteration. >> >> >> *Below are my **trails**:* >> >> >> *For Arabic Numbers:* >> >> >> -> i tried to custom train only Arabic numbers. >> -> i wrote a script to write 100,000 numbers in multiple gt.txt files. >> 100s of character in each gt.txt file. >> -> then one script to convert text to image (text2image) which should be >> more like scanned image. >> -> parameters used in the below order. >> >> text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir >> /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image >> false --rotate_image --exposure 2 --resolution 300 >> >> 1. How much dataset i need to prepare for arabic number, as of now >> required only for 2 specific fonts which i already have. >> 2. Will dateset be duplicate if i follow this procedure, if yes is >> there any way to avoid it. >> 3. Is that good way to create more gt.txt files with less characters >> in it (for eg 50,000 gt files with 10 numbers in each file) or less >> gt.txt >> files with more characters (for eg 1000 gt files with 500 numbers in each >> file). >> >> If possible please guide me the procedure for datasets preparation. >> >> For testing I tried 50,000 eng number, with each number in one gt.txt >> file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration >> but it fails. >> >> >> *For Arabic Text:* >> >> >> -> prepared around 23k gt.txt files each having one sentence >> >> -> generated .box and small .tifs files for all gt.txt files using 1 font >> (traditional Arabic font) >> >> -> used the tesstrain git and trained for 20,000 iteration >> >> -> after training generated foo.traineddata with 0.03 error rate >> >> -> did prediction an the real data, it is working perfect for the >> perticular character which on pre trained (ara.traineddata) failes. but >> when comes to overall accuracy the pre trained (ara.traineddata) performs >> better except that one character. >> >> >> >> *Summery:* >> >> >> >> - how to fix one character in pre >> trained (ara.traineddata) model or if not possible how to custom >> train from scratch or is there a way to annotate on real image and >> prepare >> dateset, pls suggest the best practice? >> - how to prepare Arabic number dataset and train it. if custom >> training on number not possible then can arabic numbers added with pre >> trained model (ara.traineddata) >> >> >> >> GitHub link used for custom training Arabic text and numbers: >> https://github.com/tesseract-ocr/tesstrain >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c0bb28fe-8b8d-4b94-a8f8-84f3e17cf948o%40googlegroups.com.

