Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

Kristóf Horváth Thu, 07 Feb 2019 06:14:14 -0800

Dear Lorenzo,

thank you for your input it is very much appreciated. I will go through 
your suggestions, because I have questions or clarifications.


This thread about the font size is where I got the 30/40px indication:
>
>
> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ
>
> For my trainings (fine tuning) I used 48px (with 2px of white border, so 
> text was about 44), maybe the size does not matter much if you do fine 
> tuning, but I never did a precise comparison. Maybe 48 is even better. The 
> white border probably was not important.
>
> One thing to keep in mind is that IMO there is not THE correct way to 
> train because different fonts or different types of images (contrast, 
> noise, etc.) may work best with different parameters. So you need to 
> experiment a little with these if you want optimal results.
>
> This leads to the most important part: Am I done training? without this 
> you are just wasting time.
>

I dont exactly get what you wanted to point out , but the link for the 
source of the picture specification helps and i will try to digest it too.


What I describe in this post is not completely correct due to the way ocrd 
> works (I should discuss this on github so see if it should be fixed or not).
>
>
> https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ
>
> The basic idea of any machine learning training is this: split the data in 
> two parts, use one for training and use the other to check the result. The 
> idea is that if you train too much only on a few things you get 
> exceptionally good on these but you overspecialize and get worse at all the 
> rest (this is called overfitting). So you get 99.999% accuracy on the 
> training and 74% on the eval set and real world data that is what really 
> matters (real world is usually a little worse than eval).
>
> The problem I found is that ocrd recreates the files list.train and 
> list.eval every time you run it (it was not designed for incremental 
> training I think). So, if you follow my instructions, you'll mix the train 
> and eval files and this is bad. 
>
> So I modified the ocrd Makefile to create these two files explicitly at 
> the beginning of the training (and only once).
>
> This is the edit (about line 80):
>
> # Create lists of lstmf filenames for training and eval
> #lists: $(ALL_LSTMF) data/list.train data/list.eval
> lists: $(ALL_LSTMF)
>
> train-lists: data/list.train data/list.eval
>
> Now you need to call "make train-lists" only once when you start a new 
> training session with new data (not after each "iteration step").
>

Thanks for writing train/eval down i had the concept its just i couldn't 
put it in proper words.
 Thank you for fixing the makefile. I will include this in my documentation 
for sure.


Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have some 
> data (1000/10000 samples) do a 80/20. If you have a ton (100k+ samples) of 
> data 90/10 or evel 95/5 may be fine.

 
This is super useful info. 


About PSM. I did my training with PSM 6 but for one model (the most complex 
> one, out of 8) I found that using PSM 13 when doing the recognition gives 
> better results for punctation and other special characters.
> Again, I do not know how much difference the PSM param makes during 
> training. From what I understand PSM 6 does some custom 
> cleanup/preprocessing to the images, PSM 13 leaves them untouched 
> (completely?).
>

I read the same thing, that 13 (PSM.RAW_LINE) is the most efficient one for 
training and I am somewhat sure (It wasn't me who researched segmentation 
modes, but he says it just takes the "rawest" form of the line) that 13 
leaves them untouched.

 

> About the parameters you listed in your post: I know the meaning of a few 
> of them but I think that in general they are quite useless (or you need to 
> understand more to mess with them). What I mostly refer to is the output 
> from lstmeval. char train and word train are the error on the recognition 
> these are probably the only one to look at as a reference (but these refer 
> to the training data, not the eval data). best char error is the best so 
> far, the training is noisy and goes up and down. delta is probably the 
> variation from the previous output and rms is root mean square of 
> something. In other words you do not really understand all of them to do 
> the training.


Yes they are mostly useless, but im writing a documentation and if i say 
include this flag or that variable then i would like to include a 
definition for that flag or parameter. I am mostly interested in 3 
questions considering variables and flags i pointed out.

   - How does this file look like?
   - What does it do?
   - How can i create it?

My problem with lstmeval is mostly small confusion i just wanna clarify. 
For example: char train and word train, if they are high means that there 
are a lot of errors, right? (same goes for best char error)
Oh and those outputs, you said i dont need for training (like rms). I still 
would like to know what are those even if i only get like one confusing 
sentence, because there should be a definition for it.


One iteration means one image, so max_iterations should be at least equal 
> to your images. If you have a ton of images you may see that you do not 
> need to process all of them to reach the "saturation" point when extra 
> training is useless, but normally you want to process all of them even a 
> few times (until the eval score stabilize or get worse for a few iteration).


Thank you for writing this down because i made the same conclusion and its 
just nice to hear it from you. But my question was actually referring to 
lstmeval output.
It puts out iteration number like this iteration 31/100/100. So can you 
tell me what the 3 numbers represent?


One note: if you repeat the whole training multiple time (for example 
> trying different image sizes) you need to keep aside the list.train/eval 
> files otherwise you compare with a different set of eval images (and with a 
> little data set this can make a big difference).


Good note. This warning definitely belongs to  the newbie guide.


Another note: while you fine tune (specialize) on a new "font(s)" you get a 
> little worse on all the others. If you care about other fonts too you 
> should check on them with lstmeval too.


Very good note. I am planning to make the Overview about training longer by 
adding a section that just talks about the mechanics of training. (Things 
like what the ratio for train/eval should be, how many iterations)
I know that there are no exact answer like this is the best for this. I 
know, but as i was doing research i found many advice that was very much 
true for specifications and i will try to collect few of these just to give 
a nice example of how should you think about your training.
----
So my further plans are simple:

   - rework most things in wiki  (this is a general goal)
   - Add more flavour text to certain places (this will require testing the 
   guide out on actual people, I have monkeys for testing my guide, but 
   wouldn't mind if somebody on the forum would try it and give feedback like 
   you did Lorenzo)
   - Collect general errors, common mistakes

Once again thank you for your input and i am eagerly waiting for your reply 
Lorenzo.

2019. február 7., csütörtök 13:26:49 UTC+1 időpontban Lorenzo Blz a 
következőt írta:
>
> Hi Kristof,
> good work, I thought about it a few times. I gave a quick look, just a 
> couple of quick notes, I'll try to read it better when I get time.
>
> This thread about the font size is where I got the 30/40px indication:
>
>
> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ
>
> For my trainings (fine tuning) I used 48px (with 2px of white border, so 
> text was about 44), maybe the size does not matter much if you do fine 
> tuning, but I never did a precise comparison. Maybe 48 is even better. The 
> white border probably was not important.
>
> One thing to keep in mind is that IMO there is not THE correct way to 
> train because different fonts or different types of images (contrast, 
> noise, etc.) may work best with different parameters. So you need to 
> experiment a little with these if you want optimal results.
>
> This leads to the most important part: Am I done training? without this 
> you are just wasting time.
>
> What I describe in this post is not completely correct due to the way ocrd 
> works (I should discuss this on github so see if it should be fixed or not).
>
>
> https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ
>
> The basic idea of any machine learning training is this: split the data in 
> two parts, use one for training and use the other to check the result. The 
> idea is that if you train too much only on a few things you get 
> exceptionally good on these but you overspecialize and get worse at all the 
> rest (this is called overfitting). So you get 99.999% accuracy on the 
> training and 74% on the eval set and real world data that is what really 
> matters (real world is usually a little worse than eval).
>
> The problem I found is that ocrd recreates the files list.train and 
> list.eval every time you run it (it was not designed for incremental 
> training I think). So, if you follow my instructions, you'll mix the train 
> and eval files and this is bad. 
>
> So I modified the ocrd Makefile to create these two files explicitly at 
> the beginning of the training (and only once).
>
> This is the edit (about line 80):
>
> # Create lists of lstmf filenames for training and eval
> #lists: $(ALL_LSTMF) data/list.train data/list.eval
> lists: $(ALL_LSTMF)
>
> train-lists: data/list.train data/list.eval
>
> Now you need to call "make train-lists" only once when you start a new 
> training session with new data (not after each "iteration step").
>
> Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have 
> some data (1000/10000 samples) do a 80/20. If you have a ton (100k+ 
> samples) of data 90/10 or evel 95/5 may be fine.
>
> About PSM. I did my training with PSM 6 but for one model (the most 
> complex one, out of 8) I found that using PSM 13 when doing the recognition 
> gives better results for punctation and other special characters.
> Again, I do not know how much difference the PSM param makes during 
> training. From what I understand PSM 6 does some custom 
> cleanup/preprocessing to the images, PSM 13 leaves them untouched 
> (completely?).
>
> About the parameters you listed in your post: I know the meaning of a few 
> of them but I think that in general they are quite useless (or you need to 
> understand more to mess with them). What I mostly refer to is the output 
> from lstmeval. char train and word train are the error on the recognition 
> these are probably the only one to look at as a reference (but these refer 
> to the training data, not the eval data). best char error is the best so 
> far, the training is noisy and goes up and down. delta is probably the 
> variation from the previous output and rms is root mean square of 
> something. In other words you do not really understand all of them to do 
> the training.
>
> One iteration means one image, so max_iterations should be at least equal 
> to your images. If you have a ton of images you may see that you do not 
> need to process all of them to reach the "saturation" point when extra 
> training is useless, but normally you want to process all of them even a 
> few times (until the eval score stabilize or get worse for a few iteration).
>
> One note: if you repeat the whole training multiple time (for example 
> trying different image sizes) you need to keep aside the list.train/eval 
> files otherwise you compare with a different set of eval images (and with a 
> little data set this can make a big difference).
>
> Another note: while you fine tune (specialize) on a new "font(s)" you get 
> a little worse on all the others. If you care about other fonts too you 
> should check on them with lstmeval too.
>
>
> Bye
>
> Lorenzo
>
> Il giorno gio 7 feb 2019 alle ore 09:36 Kristóf Horváth <
> [email protected] <javascript:>> ha scritto:
>
>> Hi, i set out to make a newbie friendly guide and i already have some 
>> stuff that might already help people, but its not complete yet. I would 
>> like people to read it and where they can help out with comments. I left 
>> places empty or left notes of my own pls feel free to figure out what 
>> should be there. I really hope i didnt make big mistakes, but in case i did 
>> write something stupid pls share it in form of a constructive criticism. 
>> The following things are very unclear for me  (in terms of what they 
>> exactly represent):
>>
>>    - radical-stroke.txt
>>    - learning_rate
>>    - noextract_font_properties
>>    - 2 percent improvement
>>    - time=
>>    - best error was 100 @0
>>    - iteration 31/100/100
>>    - rms=
>>    - delta=
>>    - char train=
>>    - word train=
>>    - skip ratio=
>>    - best char error=
>>
>> And finially here is the link 
>> <https://docs.google.com/document/d/1qDqbnlptcCPVIvMOHwfNws-CQat-llZLOTHC6S94Vec/edit?usp=sharing>.
>>  
>> (Google docs should be in english, Im writing a wiki so formating is based 
>> on wiki syntax, with the link you should be able to make comments)
>> In case you are really enthusiastic about it you can contact me for write 
>> rights.
>>
>> Best Regards
>> Kristof Horvath
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/516b15b5-e6e1-4fb9-9e8a-9dd4c71b1cc5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

Reply via email to