Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Shree Devi Kumar Sun, 12 Jul 2020 04:01:15 -0700

What character are you trying to add?
Please share the training data to try and replicate the issue.



On Sun, Jul 12, 2020, 15:35 Eliyaz L <write2eli...@gmail.com> wrote:

> Hi,
>
>
> My use case is on Arabic document, the pre retrained ara.traineddata are
> good but not perfect. so i wish to fine tune ara.traineddata, if the
> results are not satisfying then have train my own custom data.
>
>
> please suggest me for the following:
>
>    1. for my use case in Arabic text, problem is in one character which
>    is always predicting wrong. so do i need to add the document font
>    (traditional arabic font) and train? if so pls provide the procedure or
>    link to add one font in pre training ara.traineddata.
>    2. if fine tuning or training from scratch, how many gt.txt files i
>    need and how many characters needs to be there in each file? and any apx
>    iterations if you know?
>    3. for number, the prediction is totally wrong on Arabic numbers, so
>    do i need to start from scratch or need to fine tune? if any then how to
>    prepare datasets for the same.
>    4. how to decide the max_iterations is there any ratio of datasets and
>    iteration.
>
>
> *Below are my **trails**:*
>
>
> *For Arabic Numbers:*
>
>
> -> i tried to custom train only Arabic numbers.
> -> i wrote a script to write 100,000 numbers in multiple gt.txt files.
> 100s of character in each gt.txt file.
> -> then one script to convert text to image (text2image) which should be
> more like scanned image.
> -> parameters used in the below order.
>
> text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir
> /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image
> false --rotate_image --exposure 2 --resolution 300
>
>    1. How much dataset i need to prepare for arabic number, as of now
>    required only for 2 specific fonts which i already have.
>    2. Will dateset be duplicate if i follow this procedure, if yes is
>    there any way to avoid it.
>    3. Is that good way to create more gt.txt files with less characters
>    in it (for eg 50,000 gt files with 10 numbers in each file) or less gt.txt
>    files with more characters (for eg 1000 gt files with 500 numbers in each
>    file).
>
> If possible please guide me the procedure for datasets preparation.
>
> For testing I tried 50,000 eng number, with each number in one gt.txt file
> (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration but it
> fails.
>
>
> *For Arabic Text:*
>
>
> -> prepared around 23k gt.txt files each having one sentence
>
> -> generated .box and small .tifs files for all gt.txt files using 1 font
> (traditional Arabic font)
>
> -> used the tesstrain git and trained for 20,000 iteration
>
> -> after training generated foo.traineddata with 0.03 error rate
>
> -> did prediction an the real data, it is working perfect for the
> perticular character which on pre trained (ara.traineddata) failes. but
> when comes to overall accuracy the pre trained (ara.traineddata) performs
> better except that one character.
>
>
>
> *Summery:*
>
>
>
>    - how to fix one character in pre
>    trained (ara.traineddata) model or if not possible how to custom
>    train from scratch or is there a way to annotate on real image and prepare
>    dateset, pls suggest the best practice?
>    - how to prepare Arabic number dataset and train it. if custom
>    training on number not possible then can arabic numbers added with pre
>    trained model (ara.traineddata)
>
>
>
> GitHub link used for custom training Arabic text and numbers:
> https://github.com/tesseract-ocr/tesstrain
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW9BP8cppF7EZ3HONG25eSj6Xf5HH7KHutyU3H94XheRg%40mail.gmail.com.

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Reply via email to