Dear Andrew, I've a couple of observations on your problem.
- The "standard" English language file was created using the set of training images of the famous computer fonts like Arial, Times, Verdana, some Ghostscript fonts and of their italic and bold versions. Your book document's characters have strokes that are much thinner then those in the above fonts. Moreover, it seems to me your scanner's settings made letter strokes even thinner. These are the reasons why Tess failed to show a good recognition rate. - Your image is gray-scale and this means it will undergo binarization prior to either training or recognition. Tess uses a pretty simple Otsu binarization procedure and despite it doesn't seem to corrupt your kind of images, it still might ruin some important character details. To make sure it doesn't you may use the DumpPGM() method of the TessBaseAPI class. But I think it's easier in your case to set up your scanner to produce monochrome images and rescan. - And the most important thing. As it is said in the TrainingTesseract document, "training from real images is actually quite hard, due to the spacing requirements". This is true but the sentence lacks just a single word before "due": "particularly". I mean there are lots of details you need to take into account when training from real images. Describing all the nuances is a challenging task so the document says not much on this subject. As for your image, a closer look to it will let you notice many character imperfections. Due to scanning artefacts, many characters are split (broken) or even totally lack some thin stroke segments. In some rows (say the top one or the bottom one) the situation is even worse. You may see that characters in these rows are randomly carved and punched; probably they are intentionally printed pale or dithered in the paper source. In fact, from Tess's point of view, these imperfections are important parts of character "prototype". Including such glyphs into the training set usually is not beneficial and also might confuse Tess during recognition as well as. So when training Tess one should adhere to the following: if character's imperfection is unusual or random then the glyph should not be included into the training set. Also be aware that including a character even with a frequent imperfection may confuse Tesseract. For instance, the letter "B" is split in such way that it lacks the three thin horizontal strokes. This results in that it resembles "I3" in some fonts. If you add this disjoint sample to the training set you may start getting "B" as a recognition result while in the source it's really "I3". So what you can do about your images? First of all, if you have "unlimited" access to the paper originals, you can try to eliminate scanning artefacts as much as possible by tweaking your scanner's settings. Second, you may try to pre-process your images. Third, you need to prepare your .box files more elaborately, i.e. make decisions on damaged box/glyph pairs and remove the unwanted ones. Not so straightforward but that's all I can help you with. Regards, Dmitry -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.