My current plan for documentation is as follows: > > - Rewrite and simplify TrainingTesseract3 on the wiki > - Write manpages for each tool in training/ > - Document how each training file is used, and what it contains > > Does that sound good to people? I'll take silence from the list to > mean "that sounds perfect in every way, you wonderful man." ;)
Thanks, Nick. That's great. You should probably have separate sections for training 3, 3.02, 3.03, 3.03.03 ...etc. Since the method has changed quite a bit. BTW, do you know if the new training tools can be compiled on Windows or do I need to to get access to Linux somewhere to give them a try. Shree Devi Kumar ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Aug 6, 2014 at 8:23 PM, Nick White <nick.wh...@durham.ac.uk> wrote: > Hi Albrecht, > > Sorry for not replying sooner, I've been away. > > > Nevertheless I read a post from Ray where he says that he receives > > millions of > > emails and the last thing he likes to do is writing long texts (email > responses > > or documentations). I think this is a fatal situation, because if he is > the > > only one who really knows the code, he is predestined to write that > > documenation. But I understood that he is not motivated to do that. He is > > testing new classifiers rather than caring about what is already done. > > Ah, but others can work to figure out how the code and tools work, > and slowly but surely piece together documentation. Also, Ray is > good at explaining when he has the time. I agree it isn't an ideal > situation, but think we can fix it. > > > > I studied the code of the set_unicharset_properties tool. > > But this is a very basic tool. It only sets the basic properties. > > The min/max values don't get touched > > This is wrong, actually. The unicharset.SetPropertiesFromOther() > function called in set_unicharset_properties copies all properties > from any copy of the character found in the script_dir. As I > mentioned in my previous message to this thread, set the script_dir > to the training/langdata directory and the data from all the > .unicharset files there will be pulled in as appropriate. > > > I'm sure that there must exist a tool > > (that is not published) that obtains them, because the han.unicharset > has 23514 > > characters defined - all with min / max values set. Or do you think that > > someone has edited 23514 characters manually ? > > Ultimately, yes, there must be an unpublished tool that obtains the > metrics that exist in the training/langdata directory. I suspect it > looks quite like the pango based proof of concept I attached to a > previous mail on this thread (charmetrics.c). > > > It is not the way open source projects should work. > > So, you pick yourself up and jump in! That's how open source > projects should work. Patches are welcomed :) > > > > Are there particular things you'd like > > > documentated, that I could start on? > > > > I would like to generate unicharset files automatically, but I don't > know how > > to calculate the min/max values. > > As I say, you can get good general figures by using the --script_dir > option with set_unicharset_properties. I think we're clear now on > the general definitions of all the fields. > > To calculate the min/max values for specific fonts where they may be > very different, I recommend you try the charmetrics.c tool I posted, > and compare the output to what you get without it. > > > If you want an idea where to start with: I think a good starting point > would be > > to explain what all these training files are good for and what they do > exactly. > > What is INTTEMP, what values does it contain exactly, how is it > generated in > > the training process and how is it used in recognition ? > > What is PFFMTABLE good for, NORMPROTO etc. > > > > And then the DAWG files. > > I still did not understand in which step of the recognition the Number > DAWG is > > used. (Did you see the weird things it contains?) > > And what is the PUNC DAWG good for, how is it used exactly ? How should I > > generate the values in it ? > > What is the difference between a flat shape table and a clustered > shapetable ? > > These are all good points, and good places to start, thank you. > > My current plan for documentation is as follows: > > - Rewrite and simplify TrainingTesseract3 on the wiki > - Write manpages for each tool in training/ > - Document how each training file is used, and what it contains > > Does that sound good to people? I'll take silence from the list to > mean "that sounds perfect in every way, you wonderful man." ;) > > Nick > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/20140806145323.GG7804%40manta.lan > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2aUSCsuuyednh9j20McdeVM2A2SG1NtYaxLtOBT5gwA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.