Dear Andrew,

I've a couple of observations on your problem.

- The "standard" English language file was created using the set of
training images of the famous computer fonts like Arial, Times,
Verdana, some Ghostscript fonts and of their italic and bold versions.
Your book document's characters have strokes that are much thinner
then those in the above fonts. Moreover, it seems to me your scanner's
settings made letter strokes even thinner. These are the reasons why
Tess failed to show a good recognition rate.

- Your image is gray-scale and this means it will undergo binarization
prior to either training or recognition. Tess uses a pretty simple
Otsu binarization procedure and despite it doesn't seem to corrupt
your kind of images, it still might ruin some important character
details. To make sure it doesn't you may use the DumpPGM() method of
the TessBaseAPI class. But I think it's easier in your case to set up
your scanner to produce monochrome images and rescan.

- And the most important thing. As it is said in the TrainingTesseract
document, "training from real images is actually quite hard, due to
the spacing requirements". This is true but the sentence lacks just a
single word before "due": "particularly". I mean there are lots of
details you need to take into account when training from real images.
Describing all the nuances is a challenging task so the document says
not much on this subject. As for your image, a closer look to it will
let you notice many character imperfections. Due to scanning
artefacts, many characters are split (broken) or even totally lack
some thin stroke segments. In some rows (say the top one or the bottom
one) the situation is even worse. You may see that characters in these
rows are randomly carved and punched; probably they are intentionally
printed pale or dithered in the paper source. In fact, from Tess's
point of view, these imperfections are important parts of character
"prototype". Including such glyphs into the training set usually is
not beneficial and also might confuse Tess during recognition as well
as.

So when training Tess one should adhere to the following: if
character's imperfection is unusual or random then the glyph should
not be included into the training set. Also be aware that including a
character even with a frequent imperfection may confuse Tesseract. For
instance, the letter "B" is split in such way that it lacks the three
thin horizontal strokes. This results in that it resembles "I3" in
some fonts. If you add this disjoint sample to the training set you
may start getting "B" as a recognition result while in the source it's
really "I3".

So what you can do about your images? First of all, if you have
"unlimited" access to the paper originals, you can try to eliminate
scanning artefacts as much as possible by tweaking your scanner's
settings. Second, you may try to pre-process your images. Third, you
need to prepare your .box files more elaborately, i.e. make decisions
on damaged box/glyph pairs and remove the unwanted ones.

Not so straightforward but that's all I can help you with.

Regards,
Dmitry

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to