Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

Kristóf Horváth Thu, 07 Feb 2019 23:23:01 -0800

Thank you Shree, that helps.

2019. február 7., csütörtök 17:31:24 UTC+1 időpontban shree a következőt 
írta:
>
> >> iteration 31/100/100
>
> see 
> https://github.com/tesseract-ocr/tesseract/blob/3a7f5e4de459f4c64f36e08b18ce1b66b1fbc876/src/lstm/lstmtrainer.cpp#L410
>
> / Appends <intro_str> iteration learning_iteration()/training_iteration()/ 
> // sample_iteration() to the log_msg. 
> void LSTMTrainer::LogIterations(const char* intro_str, STRING* log_msg) 
> const { 
> *log_msg += intro_str; 
> log_msg->add_str_int(" iteration ", learning_iteration()); 
> log_msg->add_str_int("/", training_iteration()); 
> log_msg->add_str_int("/", sample_iteration()); 
> } 
>
> >> radical-stroke.txt 
>
> See 
> https://github.com/tesseract-ocr/tesseract/blob/5fdaa479da2c52526dac1281871db5c4bdaff359/src/training/lang_model_helpers.h#L49
>
> // If pass_through is true, then the recoder will be a no-op, passing the 
> // unicharset codes through unchanged. Otherwise, the recoder will 
> "compress" 
> // the unicharset by encoding Hangul in Jamos, decomposing multi-unicode 
> // symbols into sequences of unicodes, and encoding Han using the data in 
> the 
> // radical_table_data, which must be the content of the file: 
> // langdata/radical-stroke.txt. 
>
> Even though it is only used for Han languages training, tesseract gives 
> error if file is not found for other languages too.
>
> On Thu, Feb 7, 2019 at 9:38 PM Kristóf Horváth <[email protected] 
> <javascript:>> wrote:
>
>> Thx shree. I will check it out tomorrow, but pls can you give a personal 
>> feedback? 
>> Also i left from stratch because it requires serious amount of sample 
>> data and a newbie wont have that but definetly will dig myself into this 
>> guide.
>>
>> 2019. február 7., csütörtök 16:43:11 UTC+1 időpontban shree a következőt 
>> írta:
>>>
>>> You may want to see the following guide (found using Google search)
>>>
>>>
>>> https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch
>>>
>>>
>>> On Thu, 7 Feb 2019, 19:44 Kristóf Horváth <[email protected] wrote:
>>>
>>>> Dear Lorenzo,
>>>>
>>>> thank you for your input it is very much appreciated. I will go through 
>>>> your suggestions, because I have questions or clarifications.
>>>>
>>>> This thread about the font size is where I got the 30/40px indication:
>>>>>
>>>>>
>>>>> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ
>>>>>
>>>>> For my trainings (fine tuning) I used 48px (with 2px of white border, 
>>>>> so text was about 44), maybe the size does not matter much if you do fine 
>>>>> tuning, but I never did a precise comparison. Maybe 48 is even better. 
>>>>> The 
>>>>> white border probably was not important.
>>>>>
>>>>> One thing to keep in mind is that IMO there is not THE correct way to 
>>>>> train because different fonts or different types of images (contrast, 
>>>>> noise, etc.) may work best with different parameters. So you need to 
>>>>> experiment a little with these if you want optimal results.
>>>>>
>>>>> This leads to the most important part: Am I done training? without 
>>>>> this you are just wasting time.
>>>>>
>>>>
>>>> I dont exactly get what you wanted to point out , but the link for the 
>>>> source of the picture specification helps and i will try to digest it too.
>>>>
>>>>
>>>> What I describe in this post is not completely correct due to the way 
>>>>> ocrd works (I should discuss this on github so see if it should be fixed 
>>>>> or 
>>>>> not).
>>>>>
>>>>>
>>>>> https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ
>>>>>
>>>>> The basic idea of any machine learning training is this: split the 
>>>>> data in two parts, use one for training and use the other to check the 
>>>>> result. The idea is that if you train too much only on a few things you 
>>>>> get 
>>>>> exceptionally good on these but you overspecialize and get worse at all 
>>>>> the 
>>>>> rest (this is called overfitting). So you get 99.999% accuracy on the 
>>>>> training and 74% on the eval set and real world data that is what really 
>>>>> matters (real world is usually a little worse than eval).
>>>>>
>>>>> The problem I found is that ocrd recreates the files list.train and 
>>>>> list.eval every time you run it (it was not designed for incremental 
>>>>> training I think). So, if you follow my instructions, you'll mix the 
>>>>> train 
>>>>> and eval files and this is bad. 
>>>>>
>>>>> So I modified the ocrd Makefile to create these two files explicitly 
>>>>> at the beginning of the training (and only once).
>>>>>
>>>>> This is the edit (about line 80):
>>>>>
>>>>> # Create lists of lstmf filenames for training and eval
>>>>> #lists: $(ALL_LSTMF) data/list.train data/list.eval
>>>>> lists: $(ALL_LSTMF)
>>>>>
>>>>> train-lists: data/list.train data/list.eval
>>>>>
>>>>> Now you need to call "make train-lists" only once when you start a 
>>>>> new training session with new data (not after each "iteration step").
>>>>>
>>>>
>>>> Thanks for writing train/eval down i had the concept its just i 
>>>> couldn't put it in proper words.
>>>>  Thank you for fixing the makefile. I will include this in my 
>>>> documentation for sure.
>>>>
>>>>
>>>> Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have 
>>>>> some data (1000/10000 samples) do a 80/20. If you have a ton (100k+ 
>>>>> samples) of data 90/10 or evel 95/5 may be fine.
>>>>
>>>>  
>>>> This is super useful info. 
>>>>
>>>>
>>>> About PSM. I did my training with PSM 6 but for one model (the most 
>>>>> complex one, out of 8) I found that using PSM 13 when doing the 
>>>>> recognition 
>>>>> gives better results for punctation and other special characters.
>>>>> Again, I do not know how much difference the PSM param makes during 
>>>>> training. From what I understand PSM 6 does some custom 
>>>>> cleanup/preprocessing to the images, PSM 13 leaves them untouched 
>>>>> (completely?).
>>>>>
>>>>
>>>> I read the same thing, that 13 (PSM.RAW_LINE) is the most efficient one 
>>>> for training and I am somewhat sure (It wasn't me who researched 
>>>> segmentation modes, but he says it just takes the "rawest" form of the 
>>>> line) that 13 leaves them untouched.
>>>>
>>>>  
>>>>
>>>>> About the parameters you listed in your post: I know the meaning of a 
>>>>> few of them but I think that in general they are quite useless (or you 
>>>>> need 
>>>>> to understand more to mess with them). What I mostly refer to is the 
>>>>> output 
>>>>> from lstmeval. char train and word train are the error on the recognition 
>>>>> these are probably the only one to look at as a reference (but these 
>>>>> refer 
>>>>> to the training data, not the eval data). best char error is the best so 
>>>>> far, the training is noisy and goes up and down. delta is probably the 
>>>>> variation from the previous output and rms is root mean square of 
>>>>> something. In other words you do not really understand all of them to do 
>>>>> the training.
>>>>
>>>>
>>>> Yes they are mostly useless, but im writing a documentation and if i 
>>>> say include this flag or that variable then i would like to include a 
>>>> definition for that flag or parameter. I am mostly interested in 3 
>>>> questions considering variables and flags i pointed out.
>>>>
>>>>    - How does this file look like?
>>>>    - What does it do?
>>>>    - How can i create it?
>>>>
>>>> My problem with lstmeval is mostly small confusion i just wanna 
>>>> clarify. For example: char train and word train, if they are high means 
>>>> that there are a lot of errors, right? (same goes for best char error)
>>>> Oh and those outputs, you said i dont need for training (like rms). I 
>>>> still would like to know what are those even if i only get like one 
>>>> confusing sentence, because there should be a definition for it.
>>>>
>>>>
>>>> One iteration means one image, so max_iterations should be at least 
>>>>> equal to your images. If you have a ton of images you may see that you do 
>>>>> not need to process all of them to reach the "saturation" point when 
>>>>> extra 
>>>>> training is useless, but normally you want to process all of them even a 
>>>>> few times (until the eval score stabilize or get worse for a few 
>>>>> iteration).
>>>>
>>>>
>>>> Thank you for writing this down because i made the same conclusion and 
>>>> its just nice to hear it from you. But my question was actually referring 
>>>> to lstmeval output.
>>>> It puts out iteration number like this iteration 31/100/100. So can you 
>>>> tell me what the 3 numbers represent?
>>>>
>>>>
>>>> One note: if you repeat the whole training multiple time (for example 
>>>>> trying different image sizes) you need to keep aside the list.train/eval 
>>>>> files otherwise you compare with a different set of eval images (and with 
>>>>> a 
>>>>> little data set this can make a big difference).
>>>>
>>>>
>>>> Good note. This warning definitely belongs to  the newbie guide.
>>>>
>>>>
>>>> Another note: while you fine tune (specialize) on a new "font(s)" you 
>>>>> get a little worse on all the others. If you care about other fonts too 
>>>>> you 
>>>>> should check on them with lstmeval too.
>>>>
>>>>
>>>> Very good note. I am planning to make the Overview about training 
>>>> longer by adding a section that just talks about the mechanics of 
>>>> training. 
>>>> (Things like what the ratio for train/eval should be, how many iterations)
>>>> I know that there are no exact answer like this is the best for this. I 
>>>> know, but as i was doing research i found many advice that was very much 
>>>> true for specifications and i will try to collect few of these just to 
>>>> give 
>>>> a nice example of how should you think about your training.
>>>> ----
>>>> So my further plans are simple:
>>>>
>>>>    - rework most things in wiki  (this is a general goal)
>>>>    - Add more flavour text to certain places (this will require 
>>>>    testing the guide out on actual people, I have monkeys for testing my 
>>>>    guide, but wouldn't mind if somebody on the forum would try it and give 
>>>>    feedback like you did Lorenzo)
>>>>    - Collect general errors, common mistakes
>>>>
>>>> Once again thank you for your input and i am eagerly waiting for your 
>>>> reply Lorenzo.
>>>>
>>>> 2019. február 7., csütörtök 13:26:49 UTC+1 időpontban Lorenzo Blz a 
>>>> következőt írta:
>>>>>
>>>>> Hi Kristof,
>>>>> good work, I thought about it a few times. I gave a quick look, just a 
>>>>> couple of quick notes, I'll try to read it better when I get time.
>>>>>
>>>>> This thread about the font size is where I got the 30/40px indication:
>>>>>
>>>>>
>>>>> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ
>>>>>
>>>>> For my trainings (fine tuning) I used 48px (with 2px of white border, 
>>>>> so text was about 44), maybe the size does not matter much if you do fine 
>>>>> tuning, but I never did a precise comparison. Maybe 48 is even better. 
>>>>> The 
>>>>> white border probably was not important.
>>>>>
>>>>> One thing to keep in mind is that IMO there is not THE correct way to 
>>>>> train because different fonts or different types of images (contrast, 
>>>>> noise, etc.) may work best with different parameters. So you need to 
>>>>> experiment a little with these if you want optimal results.
>>>>>
>>>>> This leads to the most important part: Am I done training? without 
>>>>> this you are just wasting time.
>>>>>
>>>>> What I describe in this post is not completely correct due to the way 
>>>>> ocrd works (I should discuss this on github so see if it should be fixed 
>>>>> or 
>>>>> not).
>>>>>
>>>>>
>>>>> https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ
>>>>>
>>>>> The basic idea of any machine learning training is this: split the 
>>>>> data in two parts, use one for training and use the other to check the 
>>>>> result. The idea is that if you train too much only on a few things you 
>>>>> get 
>>>>> exceptionally good on these but you overspecialize and get worse at all 
>>>>> the 
>>>>> rest (this is called overfitting). So you get 99.999% accuracy on the 
>>>>> training and 74% on the eval set and real world data that is what really 
>>>>> matters (real world is usually a little worse than eval).
>>>>>
>>>>> The problem I found is that ocrd recreates the files list.train and 
>>>>> list.eval every time you run it (it was not designed for incremental 
>>>>> training I think). So, if you follow my instructions, you'll mix the 
>>>>> train 
>>>>> and eval files and this is bad. 
>>>>>
>>>>> So I modified the ocrd Makefile to create these two files explicitly 
>>>>> at the beginning of the training (and only once).
>>>>>
>>>>> This is the edit (about line 80):
>>>>>
>>>>> # Create lists of lstmf filenames for training and eval
>>>>> #lists: $(ALL_LSTMF) data/list.train data/list.eval
>>>>> lists: $(ALL_LSTMF)
>>>>>
>>>>> train-lists: data/list.train data/list.eval
>>>>>
>>>>> Now you need to call "make train-lists" only once when you start a 
>>>>> new training session with new data (not after each "iteration step").
>>>>>
>>>>> Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have 
>>>>> some data (1000/10000 samples) do a 80/20. If you have a ton (100k+ 
>>>>> samples) of data 90/10 or evel 95/5 may be fine.
>>>>>
>>>>> About PSM. I did my training with PSM 6 but for one model (the most 
>>>>> complex one, out of 8) I found that using PSM 13 when doing the 
>>>>> recognition 
>>>>> gives better results for punctation and other special characters.
>>>>> Again, I do not know how much difference the PSM param makes during 
>>>>> training. From what I understand PSM 6 does some custom 
>>>>> cleanup/preprocessing to the images, PSM 13 leaves them untouched 
>>>>> (completely?).
>>>>>
>>>>> About the parameters you listed in your post: I know the meaning of a 
>>>>> few of them but I think that in general they are quite useless (or you 
>>>>> need 
>>>>> to understand more to mess with them). What I mostly refer to is the 
>>>>> output 
>>>>> from lstmeval. char train and word train are the error on the recognition 
>>>>> these are probably the only one to look at as a reference (but these 
>>>>> refer 
>>>>> to the training data, not the eval data). best char error is the best so 
>>>>> far, the training is noisy and goes up and down. delta is probably the 
>>>>> variation from the previous output and rms is root mean square of 
>>>>> something. In other words you do not really understand all of them to do 
>>>>> the training.
>>>>>
>>>>> One iteration means one image, so max_iterations should be at least 
>>>>> equal to your images. If you have a ton of images you may see that you do 
>>>>> not need to process all of them to reach the "saturation" point when 
>>>>> extra 
>>>>> training is useless, but normally you want to process all of them even a 
>>>>> few times (until the eval score stabilize or get worse for a few 
>>>>> iteration).
>>>>>
>>>>> One note: if you repeat the whole training multiple time (for example 
>>>>> trying different image sizes) you need to keep aside the list.train/eval 
>>>>> files otherwise you compare with a different set of eval images (and with 
>>>>> a 
>>>>> little data set this can make a big difference).
>>>>>
>>>>> Another note: while you fine tune (specialize) on a new "font(s)" you 
>>>>> get a little worse on all the others. If you care about other fonts too 
>>>>> you 
>>>>> should check on them with lstmeval too.
>>>>>
>>>>>
>>>>> Bye
>>>>>
>>>>> Lorenzo
>>>>>
>>>>> Il giorno gio 7 feb 2019 alle ore 09:36 Kristóf Horváth <
>>>>> [email protected]> ha scritto:
>>>>>
>>>>>> Hi, i set out to make a newbie friendly guide and i already have some 
>>>>>> stuff that might already help people, but its not complete yet. I would 
>>>>>> like people to read it and where they can help out with comments. I left 
>>>>>> places empty or left notes of my own pls feel free to figure out what 
>>>>>> should be there. I really hope i didnt make big mistakes, but in case i 
>>>>>> did 
>>>>>> write something stupid pls share it in form of a constructive criticism. 
>>>>>> The following things are very unclear for me  (in terms of what they 
>>>>>> exactly represent):
>>>>>>
>>>>>>    - radical-stroke.txt
>>>>>>    - learning_rate
>>>>>>    - noextract_font_properties
>>>>>>    - 2 percent improvement
>>>>>>    - time=
>>>>>>    - best error was 100 @0
>>>>>>    - iteration 31/100/100
>>>>>>    - rms=
>>>>>>    - delta=
>>>>>>    - char train=
>>>>>>    - word train=
>>>>>>    - skip ratio=
>>>>>>    - best char error=
>>>>>>
>>>>>> And finially here is the link 
>>>>>> <https://docs.google.com/document/d/1qDqbnlptcCPVIvMOHwfNws-CQat-llZLOTHC6S94Vec/edit?usp=sharing>.
>>>>>>  
>>>>>> (Google docs should be in english, Im writing a wiki so formating is 
>>>>>> based 
>>>>>> on wiki syntax, with the link you should be able to make comments)
>>>>>> In case you are really enthusiastic about it you can contact me for 
>>>>>> write rights.
>>>>>>
>>>>>> Best Regards
>>>>>> Kristof Horvath
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/516b15b5-e6e1-4fb9-9e8a-9dd4c71b1cc5%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/516b15b5-e6e1-4fb9-9e8a-9dd4c71b1cc5%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/751567a3-b21b-4d98-a759-ce6932ed068e%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/751567a3-b21b-4d98-a759-ce6932ed068e%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ad037216-9a0e-45ab-9038-903272f36e2a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

Reply via email to