Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Shree Devi Kumar Wed, 06 Aug 2014 08:21:38 -0700

My current plan for documentation is as follows:
>
> - Rewrite and simplify TrainingTesseract3 on the wiki
> - Write manpages for each tool in training/
> - Document how each training file is used, and what it contains
>
> Does that sound good to people? I'll take silence from the list to
> mean "that sounds perfect in every way, you wonderful man." ;)



Thanks, Nick. That's great. You should probably have separate sections for
training 3, 3.02, 3.03, 3.03.03 ...etc. Since the method has changed quite
a bit.

BTW, do you know if the new training tools can be compiled on Windows or do
I need to to get access to Linux somewhere to give them a try.





Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, Aug 6, 2014 at 8:23 PM, Nick White <nick.wh...@durham.ac.uk> wrote:

> Hi Albrecht,
>
> Sorry for not replying sooner, I've been away.
>
> > Nevertheless I read a post from Ray where he says that he receives
> > millions of
> > emails and the last thing he likes to do is writing long texts (email
> responses
> > or documentations). I think this is a fatal situation, because if he is
> the
> > only one who really knows the code, he is predestined to write that
> > documenation. But I understood that he is not motivated to do that. He is
> > testing new classifiers rather than caring about what is already done.
>
> Ah, but others can work to figure out how the code and tools work,
> and slowly but surely piece together documentation. Also, Ray is
> good at explaining when he has the time. I agree it isn't an ideal
> situation, but think we can fix it.
>
>
> > I studied the code of the set_unicharset_properties tool.
> > But this is a very basic tool. It only sets the basic properties.
> > The min/max values don't get touched
>
> This is wrong, actually. The unicharset.SetPropertiesFromOther()
> function called in set_unicharset_properties copies all properties
> from any copy of the character found in the script_dir. As I
> mentioned in my previous message to this thread, set the script_dir
> to the training/langdata directory and the data from all the
> .unicharset files there will be pulled in as appropriate.
>
> > I'm sure that there must exist a tool
> > (that is not published) that obtains them, because the han.unicharset
> has 23514
> > characters defined - all with min / max values set. Or do you think that
> > someone has edited 23514 characters manually ?
>
> Ultimately, yes, there must be an unpublished tool that obtains the
> metrics that exist in the training/langdata directory. I suspect it
> looks quite like the pango based proof of concept I attached to a
> previous mail on this thread (charmetrics.c).
>
> > It is not the way open source projects should work.
>
> So, you pick yourself up and jump in! That's how open source
> projects should work. Patches are welcomed :)
>
> > > Are there particular things you'd like
> > > documentated, that I could start on?
> >
> > I would like to generate unicharset files automatically, but I don't
> know how
> > to calculate the min/max values.
>
> As I say, you can get good general figures by using the --script_dir
> option with set_unicharset_properties. I think we're clear now on
> the general definitions of all the fields.
>
> To calculate the min/max values for specific fonts where they may be
> very different, I recommend you try the charmetrics.c tool I posted,
> and compare the output to what you get without it.
>
> > If you want an idea where to start with: I think a good starting point
> would be
> > to explain what all these training files are good for and what they do
> exactly.
> > What is INTTEMP, what values does it contain exactly, how is it
> generated in
> > the training process and how is it used in recognition ?
> > What is PFFMTABLE good for, NORMPROTO etc.
> >
> > And then the DAWG files.
> > I still did not understand in which step of the recognition the Number
> DAWG is
> > used. (Did you see the weird things it contains?)
> > And what is the PUNC DAWG good for, how is it used exactly ? How should I
> > generate the values in it ?
> > What is the difference between a flat shape table and a clustered
> shapetable ?
>
> These are all good points, and good places to start, thank you.
>
> My current plan for documentation is as follows:
>
> - Rewrite and simplify TrainingTesseract3 on the wiki
> - Write manpages for each tool in training/
> - Document how each training file is used, and what it contains
>
> Does that sound good to people? I'll take silence from the list to
> mean "that sounds perfect in every way, you wonderful man." ;)
>
> Nick
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/20140806145323.GG7804%40manta.lan
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2aUSCsuuyednh9j20McdeVM2A2SG1NtYaxLtOBT5gwA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Reply via email to