Re: Thoughts on having the training process take font files directly

Nick White Fri, 12 Oct 2012 06:41:38 -0700

Hi Tom, thanks for your thoughts.

A key reason for not using scans when training is when the character
set is quite large, so it would take many pages of real scans to get
a few samples of each. Plus I found the process of box editing quite
error-prone when dealing with large sets. For my Ancient Greek
training, due to different combinations of diacritics which can
apply to many characters, the character set was a lot larger than it
looked at first. Finding and scanning 'real' pages which definitely
contain all characters would be difficult. I'm sure the same would
be true for other scripts which make use of many diacritics.


That's the reason this approach makes sense to me, because in
reality it's so time-consuming and error-prone to do it otherwise.

>   • font 'hints' which cause the glyph do be rendered differently at different
>     resolutions
>   • kerning information which affects glyph placement relative to its
>     neighbors 

Aren't these two arguments *for* using font information? As one
could encode the information for characters at a few different
sizes, in a more representative fashion that you could from half a
dozen examples of characters from a page scan.

> Remember
> also that the goal isn't to extract "ideal" character shapes, but rather
> *representative* shapes

Yes, and I certainly had this point driven home to me when I was
trying out different fonts to include in my training. At first I
just included every font on my system that had the required
characters, but results were *much* worse than when only including
fonts close to the content I was scanning. I was lucky with Ancient
Greek that there has been quite a lot of work done by others in
creating fonts that closely map to how the stuff was actually
printed over the centuries.

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Thoughts on having the training process take font files directly

Reply via email to