Hi Tom, thanks for your thoughts. A key reason for not using scans when training is when the character set is quite large, so it would take many pages of real scans to get a few samples of each. Plus I found the process of box editing quite error-prone when dealing with large sets. For my Ancient Greek training, due to different combinations of diacritics which can apply to many characters, the character set was a lot larger than it looked at first. Finding and scanning 'real' pages which definitely contain all characters would be difficult. I'm sure the same would be true for other scripts which make use of many diacritics.
That's the reason this approach makes sense to me, because in reality it's so time-consuming and error-prone to do it otherwise. > • font 'hints' which cause the glyph do be rendered differently at different > resolutions > • kerning information which affects glyph placement relative to its > neighbors Aren't these two arguments *for* using font information? As one could encode the information for characters at a few different sizes, in a more representative fashion that you could from half a dozen examples of characters from a page scan. > Remember > also that the goal isn't to extract "ideal" character shapes, but rather > *representative* shapes Yes, and I certainly had this point driven home to me when I was trying out different fonts to include in my training. At first I just included every font on my system that had the required characters, but results were *much* worse than when only including fonts close to the content I was scanning. I was lucky with Ancient Greek that there has been quite a lot of work done by others in creating fonts that closely map to how the stuff was actually printed over the centuries. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

