What character are you trying to add? Please share the training data to try and replicate the issue.
On Sun, Jul 12, 2020, 15:35 Eliyaz L <write2eli...@gmail.com> wrote: > Hi, > > > My use case is on Arabic document, the pre retrained ara.traineddata are > good but not perfect. so i wish to fine tune ara.traineddata, if the > results are not satisfying then have train my own custom data. > > > please suggest me for the following: > > 1. for my use case in Arabic text, problem is in one character which > is always predicting wrong. so do i need to add the document font > (traditional arabic font) and train? if so pls provide the procedure or > link to add one font in pre training ara.traineddata. > 2. if fine tuning or training from scratch, how many gt.txt files i > need and how many characters needs to be there in each file? and any apx > iterations if you know? > 3. for number, the prediction is totally wrong on Arabic numbers, so > do i need to start from scratch or need to fine tune? if any then how to > prepare datasets for the same. > 4. how to decide the max_iterations is there any ratio of datasets and > iteration. > > > *Below are my **trails**:* > > > *For Arabic Numbers:* > > > -> i tried to custom train only Arabic numbers. > -> i wrote a script to write 100,000 numbers in multiple gt.txt files. > 100s of character in each gt.txt file. > -> then one script to convert text to image (text2image) which should be > more like scanned image. > -> parameters used in the below order. > > text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir > /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image > false --rotate_image --exposure 2 --resolution 300 > > 1. How much dataset i need to prepare for arabic number, as of now > required only for 2 specific fonts which i already have. > 2. Will dateset be duplicate if i follow this procedure, if yes is > there any way to avoid it. > 3. Is that good way to create more gt.txt files with less characters > in it (for eg 50,000 gt files with 10 numbers in each file) or less gt.txt > files with more characters (for eg 1000 gt files with 500 numbers in each > file). > > If possible please guide me the procedure for datasets preparation. > > For testing I tried 50,000 eng number, with each number in one gt.txt file > (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration but it > fails. > > > *For Arabic Text:* > > > -> prepared around 23k gt.txt files each having one sentence > > -> generated .box and small .tifs files for all gt.txt files using 1 font > (traditional Arabic font) > > -> used the tesstrain git and trained for 20,000 iteration > > -> after training generated foo.traineddata with 0.03 error rate > > -> did prediction an the real data, it is working perfect for the > perticular character which on pre trained (ara.traineddata) failes. but > when comes to overall accuracy the pre trained (ara.traineddata) performs > better except that one character. > > > > *Summery:* > > > > - how to fix one character in pre > trained (ara.traineddata) model or if not possible how to custom > train from scratch or is there a way to annotate on real image and prepare > dateset, pls suggest the best practice? > - how to prepare Arabic number dataset and train it. if custom > training on number not possible then can arabic numbers added with pre > trained model (ara.traineddata) > > > > GitHub link used for custom training Arabic text and numbers: > https://github.com/tesseract-ocr/tesstrain > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW9BP8cppF7EZ3HONG25eSj6Xf5HH7KHutyU3H94XheRg%40mail.gmail.com.