Re: [tesseract-ocr] Fine tuning existing model

Lorenzo Bolzani Mon, 02 Jul 2018 03:24:28 -0700

Hi Shree,
I replaced the line:

 merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
$(TRAIN)/my.unicharset  "$@"


with:

 cp "$(TRAIN)/my.unicharset" "data/unicharset"

(I write this in case someone else is following this thread).

And now I have a fine tuned brand new model with only the characters I
need. Nice :)

For the training I'm using actual crops from the documents I need to ocr,
painfully hand labeled.

About the number of iterations I'm trying to figure it out. I've seen that
there is an eval/train split, I've set it to 80/20.

I did 300/600/1000/5000/7500/10000 iteration and checked the model with:

lstmeval --model export/$1.traineddata --eval_listfile data/list.eval 2>&1
| grep iteration

and I see that the eval error keeps going down, with a big error drop from
1.17 to 0.5 passing from 7500 to 10000. My characters are very noisy and
irregular and my lines are very short, 1 to 4 words at most. Maybe this is
the reason why I need more iterations.

I'm fine tuning from italian, the language of my documents, I'll try eng
too to see if it works better. Now that the pipeline is in place it's easy
to try different options.


Thank you for your help so far.


Bye

Lorenzo


2018-06-30 6:18 GMT+02:00 Shree Devi Kumar <shreesh...@gmail.com>:

> >
> 
> The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files.
> Now I can run your script directly.
>
> Oh, I remember now. I had changed that for ease in renaming files for some
> reason.
>
> > In this way can I train a model that, for example, only recognize
> uppercase characters, or numbers, simply by providing only uppercase
> training data? Or is there something else to configure?
>
> You could try finetune from English. Remove the line regarding merge of
> unicharsets from my makefile (use command from original script). 300
> iterations should be enough as you are not adding any characters. Try to
> have a training text which resembles the kind of words that you expect to
> OCR.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAG2NduUpE8TeQXqto-Ahb7Mm%3DR4C5qOavthm0Y30ZbnvdrWr6w%
> 40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUpE8TeQXqto-Ahb7Mm%3DR4C5qOavthm0Y30ZbnvdrWr6w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzOaDqo9Ja%2BG5pa9hCH0i6BTN8ShEj4ZUxa%2BH5qANWyKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Fine tuning existing model

Reply via email to