Training tesseract 3.01 with new font, for reading non dictionary strings - ideal training text?

Adam Chapam Fri, 19 Oct 2012 02:39:37 -0700

Having had good levels of success with tesseract in a couple of projects 
using default eng.traineddata, (and considerable pre-processing where 
needed) i now find myself needing to train it for a specialized font.


I can follow the training wiki and produce working traineddata files, and 
have written a .net app to automate creating tif/box pairs from a font 
file, (i know there are plenty of other tools out there, but i have no 
desire to boot into linux or learn python just for this) but i am unsure of 
the best text to use for training.

I discovered what may be the default training text here: 
http://michaeljaylissner.com/blog/adding-new-fonts-to-tesseract-3-ocr-engine
But i have some doubts about its usefulness:

1 it contains no spaces, which seems like a bad idea surely?
2 it contains all sorts of characters i do not need. All i need is a-z 
(upper and lower) plus 0-9.

The training wiki suggests that abcdefghijklmnopqrstuvwxyz1234567890 would 
be a terrible training text, and i presume this is because it needs to 
learn baseline metrics and other such things, but the images i need to work 
with will not contain any words, just a string, for example:

ABD15657ttg2


(There is a pattern, but pattern matching is another question all together).
The reason i need to train tesseract is because the font is a blocky 
display type one (think ms dos/terminal)  that the default training data 
constantly interprets A as Q and a few other examples, no matter 
what pre-processing i do. I read up on unicharambigs but as either letters 
may be present, and there will be no dictionary words for it to take a hint 
from, then that option seems unavailable to me.
I tried segmenting myself and processing one char at a time, but it still 
confused the same chars

The other thing that confused me was the need to have x many 
representations of a character in the training text. If using scanned 
images with inevitable small variances between the same characters, that 
makes sense, but using digitally rendered tiffs, they will all be exactly 
the same, so what benefit is there of repeating a character? Is the 
frequancy used to determine between similar characters later on, eg :
This letter could be an O or a D. The letter D occurred 20 times in 
training, but O only appeared 7 times, so therefore D is the most likely 
outcome?

As i am creating tiff/box pairs programmatically, the amont of text 
required is trivial - 100 or 1000000 chars takes the same amount of human 
effort, 

Of course i dont really NEED to know why this works, i just need to get it 
working, but as im likely to be using tesseract in future projects it would 
be better for me to learn the why not just the how.

Anyway, my question remains this:

For training tesseract a new font, with the purpose of reading non 
dictionary strings, what would be a suitable training text?

Any help appreciated.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Training tesseract 3.01 with new font, for reading non dictionary strings - ideal training text?

Reply via email to