This is a known issue with Tesseract. One solution is to process the
OCR results then detect the size discrepancy between the two parts of
the line and then re-process each part as a separate image. In
essence, doing that prevents Tesseract from drawing bad inferences.

I think Tesseract 3.01 brings improvements to this issue from initial
tests I have run. I will try your images and post back here with my


On Mar 9, 9:56 am, Søren Engel <> wrote:
> Hello fellow members,
> I am currently working on a upgrading an old OCR module at our
> development team which was originally written in VB using the
> integreted OCR components within Microsoft Office 2003. Since this is
> discontinued (as Microsoft was discarded the COM components from the
> office distributions), I have been trying to locate a fitting module
> to take to spot - leading me to this site and binaries.
> My problem now stands as follows, while performing a OCR on a basic
> invoice coming from a fax using the tesseract-ocr commandline app,
> whereas the invoice contains a sentence like
> -- contract number: 214587 --
> I noticed, since the numbers may in some cases by twice the size as
> the other text, that the text just messy and the numbers are not
> included in the output. I somehow suspect that the OCR has a upper
> threadshold limit on how large a character must be, compared to other
> characters in the image, before it is considered not to be text, thus
> discarded from the output.
> Since I am both new to the concept of OCR as well as our team don't
> have members with a strong background in image processing, I am a bit
> unsure how to tackle this issue.
> The images I have been using for my initial tests may be found at
> I should mention that the image test.tif is at its original size and
> gives bogus results, where as the test_01.tif is scaled about 33% of
> the original size and produces the correct result. However, it is not
> a viable solution just to scale the images, since this may lead to
> incorrect results for the rest of the document (note: not everything
> is included in these images due to privacy restrictions).
> I should also mention, that the old microsoft COM component had no
> problems handling these images.
> I hope that I can get some feedback on this matter.
> Cheers,
> Kind regards
> Søren Engel

You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to
To unsubscribe from this group, send email to
For more options, visit this group at

Reply via email to