Thank you Shree, that helps. 2019. február 7., csütörtök 17:31:24 UTC+1 időpontban shree a következőt írta: > > >> iteration 31/100/100 > > see > https://github.com/tesseract-ocr/tesseract/blob/3a7f5e4de459f4c64f36e08b18ce1b66b1fbc876/src/lstm/lstmtrainer.cpp#L410 > > / Appends <intro_str> iteration learning_iteration()/training_iteration()/ > // sample_iteration() to the log_msg. > void LSTMTrainer::LogIterations(const char* intro_str, STRING* log_msg) > const { > *log_msg += intro_str; > log_msg->add_str_int(" iteration ", learning_iteration()); > log_msg->add_str_int("/", training_iteration()); > log_msg->add_str_int("/", sample_iteration()); > } > > >> radical-stroke.txt > > See > https://github.com/tesseract-ocr/tesseract/blob/5fdaa479da2c52526dac1281871db5c4bdaff359/src/training/lang_model_helpers.h#L49 > > // If pass_through is true, then the recoder will be a no-op, passing the > // unicharset codes through unchanged. Otherwise, the recoder will > "compress" > // the unicharset by encoding Hangul in Jamos, decomposing multi-unicode > // symbols into sequences of unicodes, and encoding Han using the data in > the > // radical_table_data, which must be the content of the file: > // langdata/radical-stroke.txt. > > Even though it is only used for Han languages training, tesseract gives > error if file is not found for other languages too. > > On Thu, Feb 7, 2019 at 9:38 PM Kristóf Horváth <[email protected] > <javascript:>> wrote: > >> Thx shree. I will check it out tomorrow, but pls can you give a personal >> feedback? >> Also i left from stratch because it requires serious amount of sample >> data and a newbie wont have that but definetly will dig myself into this >> guide. >> >> 2019. február 7., csütörtök 16:43:11 UTC+1 időpontban shree a következőt >> írta: >>> >>> You may want to see the following guide (found using Google search) >>> >>> >>> https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch >>> >>> >>> On Thu, 7 Feb 2019, 19:44 Kristóf Horváth <[email protected] wrote: >>> >>>> Dear Lorenzo, >>>> >>>> thank you for your input it is very much appreciated. I will go through >>>> your suggestions, because I have questions or clarifications. >>>> >>>> This thread about the font size is where I got the 30/40px indication: >>>>> >>>>> >>>>> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ >>>>> >>>>> For my trainings (fine tuning) I used 48px (with 2px of white border, >>>>> so text was about 44), maybe the size does not matter much if you do fine >>>>> tuning, but I never did a precise comparison. Maybe 48 is even better. >>>>> The >>>>> white border probably was not important. >>>>> >>>>> One thing to keep in mind is that IMO there is not THE correct way to >>>>> train because different fonts or different types of images (contrast, >>>>> noise, etc.) may work best with different parameters. So you need to >>>>> experiment a little with these if you want optimal results. >>>>> >>>>> This leads to the most important part: Am I done training? without >>>>> this you are just wasting time. >>>>> >>>> >>>> I dont exactly get what you wanted to point out , but the link for the >>>> source of the picture specification helps and i will try to digest it too. >>>> >>>> >>>> What I describe in this post is not completely correct due to the way >>>>> ocrd works (I should discuss this on github so see if it should be fixed >>>>> or >>>>> not). >>>>> >>>>> >>>>> https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ >>>>> >>>>> The basic idea of any machine learning training is this: split the >>>>> data in two parts, use one for training and use the other to check the >>>>> result. The idea is that if you train too much only on a few things you >>>>> get >>>>> exceptionally good on these but you overspecialize and get worse at all >>>>> the >>>>> rest (this is called overfitting). So you get 99.999% accuracy on the >>>>> training and 74% on the eval set and real world data that is what really >>>>> matters (real world is usually a little worse than eval). >>>>> >>>>> The problem I found is that ocrd recreates the files list.train and >>>>> list.eval every time you run it (it was not designed for incremental >>>>> training I think). So, if you follow my instructions, you'll mix the >>>>> train >>>>> and eval files and this is bad. >>>>> >>>>> So I modified the ocrd Makefile to create these two files explicitly >>>>> at the beginning of the training (and only once). >>>>> >>>>> This is the edit (about line 80): >>>>> >>>>> # Create lists of lstmf filenames for training and eval >>>>> #lists: $(ALL_LSTMF) data/list.train data/list.eval >>>>> lists: $(ALL_LSTMF) >>>>> >>>>> train-lists: data/list.train data/list.eval >>>>> >>>>> Now you need to call "make train-lists" only once when you start a >>>>> new training session with new data (not after each "iteration step"). >>>>> >>>> >>>> Thanks for writing train/eval down i had the concept its just i >>>> couldn't put it in proper words. >>>> Thank you for fixing the makefile. I will include this in my >>>> documentation for sure. >>>> >>>> >>>> Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have >>>>> some data (1000/10000 samples) do a 80/20. If you have a ton (100k+ >>>>> samples) of data 90/10 or evel 95/5 may be fine. >>>> >>>> >>>> This is super useful info. >>>> >>>> >>>> About PSM. I did my training with PSM 6 but for one model (the most >>>>> complex one, out of 8) I found that using PSM 13 when doing the >>>>> recognition >>>>> gives better results for punctation and other special characters. >>>>> Again, I do not know how much difference the PSM param makes during >>>>> training. From what I understand PSM 6 does some custom >>>>> cleanup/preprocessing to the images, PSM 13 leaves them untouched >>>>> (completely?). >>>>> >>>> >>>> I read the same thing, that 13 (PSM.RAW_LINE) is the most efficient one >>>> for training and I am somewhat sure (It wasn't me who researched >>>> segmentation modes, but he says it just takes the "rawest" form of the >>>> line) that 13 leaves them untouched. >>>> >>>> >>>> >>>>> About the parameters you listed in your post: I know the meaning of a >>>>> few of them but I think that in general they are quite useless (or you >>>>> need >>>>> to understand more to mess with them). What I mostly refer to is the >>>>> output >>>>> from lstmeval. char train and word train are the error on the recognition >>>>> these are probably the only one to look at as a reference (but these >>>>> refer >>>>> to the training data, not the eval data). best char error is the best so >>>>> far, the training is noisy and goes up and down. delta is probably the >>>>> variation from the previous output and rms is root mean square of >>>>> something. In other words you do not really understand all of them to do >>>>> the training. >>>> >>>> >>>> Yes they are mostly useless, but im writing a documentation and if i >>>> say include this flag or that variable then i would like to include a >>>> definition for that flag or parameter. I am mostly interested in 3 >>>> questions considering variables and flags i pointed out. >>>> >>>> - How does this file look like? >>>> - What does it do? >>>> - How can i create it? >>>> >>>> My problem with lstmeval is mostly small confusion i just wanna >>>> clarify. For example: char train and word train, if they are high means >>>> that there are a lot of errors, right? (same goes for best char error) >>>> Oh and those outputs, you said i dont need for training (like rms). I >>>> still would like to know what are those even if i only get like one >>>> confusing sentence, because there should be a definition for it. >>>> >>>> >>>> One iteration means one image, so max_iterations should be at least >>>>> equal to your images. If you have a ton of images you may see that you do >>>>> not need to process all of them to reach the "saturation" point when >>>>> extra >>>>> training is useless, but normally you want to process all of them even a >>>>> few times (until the eval score stabilize or get worse for a few >>>>> iteration). >>>> >>>> >>>> Thank you for writing this down because i made the same conclusion and >>>> its just nice to hear it from you. But my question was actually referring >>>> to lstmeval output. >>>> It puts out iteration number like this iteration 31/100/100. So can you >>>> tell me what the 3 numbers represent? >>>> >>>> >>>> One note: if you repeat the whole training multiple time (for example >>>>> trying different image sizes) you need to keep aside the list.train/eval >>>>> files otherwise you compare with a different set of eval images (and with >>>>> a >>>>> little data set this can make a big difference). >>>> >>>> >>>> Good note. This warning definitely belongs to the newbie guide. >>>> >>>> >>>> Another note: while you fine tune (specialize) on a new "font(s)" you >>>>> get a little worse on all the others. If you care about other fonts too >>>>> you >>>>> should check on them with lstmeval too. >>>> >>>> >>>> Very good note. I am planning to make the Overview about training >>>> longer by adding a section that just talks about the mechanics of >>>> training. >>>> (Things like what the ratio for train/eval should be, how many iterations) >>>> I know that there are no exact answer like this is the best for this. I >>>> know, but as i was doing research i found many advice that was very much >>>> true for specifications and i will try to collect few of these just to >>>> give >>>> a nice example of how should you think about your training. >>>> ---- >>>> So my further plans are simple: >>>> >>>> - rework most things in wiki (this is a general goal) >>>> - Add more flavour text to certain places (this will require >>>> testing the guide out on actual people, I have monkeys for testing my >>>> guide, but wouldn't mind if somebody on the forum would try it and give >>>> feedback like you did Lorenzo) >>>> - Collect general errors, common mistakes >>>> >>>> Once again thank you for your input and i am eagerly waiting for your >>>> reply Lorenzo. >>>> >>>> 2019. február 7., csütörtök 13:26:49 UTC+1 időpontban Lorenzo Blz a >>>> következőt írta: >>>>> >>>>> Hi Kristof, >>>>> good work, I thought about it a few times. I gave a quick look, just a >>>>> couple of quick notes, I'll try to read it better when I get time. >>>>> >>>>> This thread about the font size is where I got the 30/40px indication: >>>>> >>>>> >>>>> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ >>>>> >>>>> For my trainings (fine tuning) I used 48px (with 2px of white border, >>>>> so text was about 44), maybe the size does not matter much if you do fine >>>>> tuning, but I never did a precise comparison. Maybe 48 is even better. >>>>> The >>>>> white border probably was not important. >>>>> >>>>> One thing to keep in mind is that IMO there is not THE correct way to >>>>> train because different fonts or different types of images (contrast, >>>>> noise, etc.) may work best with different parameters. So you need to >>>>> experiment a little with these if you want optimal results. >>>>> >>>>> This leads to the most important part: Am I done training? without >>>>> this you are just wasting time. >>>>> >>>>> What I describe in this post is not completely correct due to the way >>>>> ocrd works (I should discuss this on github so see if it should be fixed >>>>> or >>>>> not). >>>>> >>>>> >>>>> https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ >>>>> >>>>> The basic idea of any machine learning training is this: split the >>>>> data in two parts, use one for training and use the other to check the >>>>> result. The idea is that if you train too much only on a few things you >>>>> get >>>>> exceptionally good on these but you overspecialize and get worse at all >>>>> the >>>>> rest (this is called overfitting). So you get 99.999% accuracy on the >>>>> training and 74% on the eval set and real world data that is what really >>>>> matters (real world is usually a little worse than eval). >>>>> >>>>> The problem I found is that ocrd recreates the files list.train and >>>>> list.eval every time you run it (it was not designed for incremental >>>>> training I think). So, if you follow my instructions, you'll mix the >>>>> train >>>>> and eval files and this is bad. >>>>> >>>>> So I modified the ocrd Makefile to create these two files explicitly >>>>> at the beginning of the training (and only once). >>>>> >>>>> This is the edit (about line 80): >>>>> >>>>> # Create lists of lstmf filenames for training and eval >>>>> #lists: $(ALL_LSTMF) data/list.train data/list.eval >>>>> lists: $(ALL_LSTMF) >>>>> >>>>> train-lists: data/list.train data/list.eval >>>>> >>>>> Now you need to call "make train-lists" only once when you start a >>>>> new training session with new data (not after each "iteration step"). >>>>> >>>>> Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have >>>>> some data (1000/10000 samples) do a 80/20. If you have a ton (100k+ >>>>> samples) of data 90/10 or evel 95/5 may be fine. >>>>> >>>>> About PSM. I did my training with PSM 6 but for one model (the most >>>>> complex one, out of 8) I found that using PSM 13 when doing the >>>>> recognition >>>>> gives better results for punctation and other special characters. >>>>> Again, I do not know how much difference the PSM param makes during >>>>> training. From what I understand PSM 6 does some custom >>>>> cleanup/preprocessing to the images, PSM 13 leaves them untouched >>>>> (completely?). >>>>> >>>>> About the parameters you listed in your post: I know the meaning of a >>>>> few of them but I think that in general they are quite useless (or you >>>>> need >>>>> to understand more to mess with them). What I mostly refer to is the >>>>> output >>>>> from lstmeval. char train and word train are the error on the recognition >>>>> these are probably the only one to look at as a reference (but these >>>>> refer >>>>> to the training data, not the eval data). best char error is the best so >>>>> far, the training is noisy and goes up and down. delta is probably the >>>>> variation from the previous output and rms is root mean square of >>>>> something. In other words you do not really understand all of them to do >>>>> the training. >>>>> >>>>> One iteration means one image, so max_iterations should be at least >>>>> equal to your images. If you have a ton of images you may see that you do >>>>> not need to process all of them to reach the "saturation" point when >>>>> extra >>>>> training is useless, but normally you want to process all of them even a >>>>> few times (until the eval score stabilize or get worse for a few >>>>> iteration). >>>>> >>>>> One note: if you repeat the whole training multiple time (for example >>>>> trying different image sizes) you need to keep aside the list.train/eval >>>>> files otherwise you compare with a different set of eval images (and with >>>>> a >>>>> little data set this can make a big difference). >>>>> >>>>> Another note: while you fine tune (specialize) on a new "font(s)" you >>>>> get a little worse on all the others. If you care about other fonts too >>>>> you >>>>> should check on them with lstmeval too. >>>>> >>>>> >>>>> Bye >>>>> >>>>> Lorenzo >>>>> >>>>> Il giorno gio 7 feb 2019 alle ore 09:36 Kristóf Horváth < >>>>> [email protected]> ha scritto: >>>>> >>>>>> Hi, i set out to make a newbie friendly guide and i already have some >>>>>> stuff that might already help people, but its not complete yet. I would >>>>>> like people to read it and where they can help out with comments. I left >>>>>> places empty or left notes of my own pls feel free to figure out what >>>>>> should be there. I really hope i didnt make big mistakes, but in case i >>>>>> did >>>>>> write something stupid pls share it in form of a constructive criticism. >>>>>> The following things are very unclear for me (in terms of what they >>>>>> exactly represent): >>>>>> >>>>>> - radical-stroke.txt >>>>>> - learning_rate >>>>>> - noextract_font_properties >>>>>> - 2 percent improvement >>>>>> - time= >>>>>> - best error was 100 @0 >>>>>> - iteration 31/100/100 >>>>>> - rms= >>>>>> - delta= >>>>>> - char train= >>>>>> - word train= >>>>>> - skip ratio= >>>>>> - best char error= >>>>>> >>>>>> And finially here is the link >>>>>> <https://docs.google.com/document/d/1qDqbnlptcCPVIvMOHwfNws-CQat-llZLOTHC6S94Vec/edit?usp=sharing>. >>>>>> >>>>>> (Google docs should be in english, Im writing a wiki so formating is >>>>>> based >>>>>> on wiki syntax, with the link you should be able to make comments) >>>>>> In case you are really enthusiastic about it you can contact me for >>>>>> write rights. >>>>>> >>>>>> Best Regards >>>>>> Kristof Horvath >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/516b15b5-e6e1-4fb9-9e8a-9dd4c71b1cc5%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/516b15b5-e6e1-4fb9-9e8a-9dd4c71b1cc5%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/751567a3-b21b-4d98-a759-ce6932ed068e%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/751567a3-b21b-4d98-a759-ce6932ed068e%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ad037216-9a0e-45ab-9038-903272f36e2a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

