Having had good levels of success with tesseract in a couple of projects using default eng.traineddata, (and considerable pre-processing where needed) i now find myself needing to train it for a specialized font.
I can follow the training wiki and produce working traineddata files, and have written a .net app to automate creating tif/box pairs from a font file, (i know there are plenty of other tools out there, but i have no desire to boot into linux or learn python just for this) but i am unsure of the best text to use for training. I discovered what may be the default training text here: http://michaeljaylissner.com/blog/adding-new-fonts-to-tesseract-3-ocr-engine But i have some doubts about its usefulness: 1 it contains no spaces, which seems like a bad idea surely? 2 it contains all sorts of characters i do not need. All i need is a-z (upper and lower) plus 0-9. The training wiki suggests that abcdefghijklmnopqrstuvwxyz1234567890 would be a terrible training text, and i presume this is because it needs to learn baseline metrics and other such things, but the images i need to work with will not contain any words, just a string, for example: ABD15657ttg2 (There is a pattern, but pattern matching is another question all together). The reason i need to train tesseract is because the font is a blocky display type one (think ms dos/terminal) that the default training data constantly interprets A as Q and a few other examples, no matter what pre-processing i do. I read up on unicharambigs but as either letters may be present, and there will be no dictionary words for it to take a hint from, then that option seems unavailable to me. I tried segmenting myself and processing one char at a time, but it still confused the same chars The other thing that confused me was the need to have x many representations of a character in the training text. If using scanned images with inevitable small variances between the same characters, that makes sense, but using digitally rendered tiffs, they will all be exactly the same, so what benefit is there of repeating a character? Is the frequancy used to determine between similar characters later on, eg : This letter could be an O or a D. The letter D occurred 20 times in training, but O only appeared 7 times, so therefore D is the most likely outcome? As i am creating tiff/box pairs programmatically, the amont of text required is trivial - 100 or 1000000 chars takes the same amount of human effort, Of course i dont really NEED to know why this works, i just need to get it working, but as im likely to be using tesseract in future projects it would be better for me to learn the why not just the how. Anyway, my question remains this: For training tesseract a new font, with the purpose of reading non dictionary strings, what would be a suitable training text? Any help appreciated. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en