Trouble recognizing characters in images with different character size

Søren Engel Wed, 09 Mar 2011 07:19:14 -0800

Hello fellow members,

I am currently working on a upgrading an old OCR module at our
development team which was originally written in VB using the
integreted OCR components within Microsoft Office 2003. Since this is
discontinued (as Microsoft was discarded the COM components from the
office distributions), I have been trying to locate a fitting module
to take to spot - leading me to this site and binaries.


My problem now stands as follows, while performing a OCR on a basic
invoice coming from a fax using the tesseract-ocr commandline app,
whereas the invoice contains a sentence like

-- contract number: 214587 --

I noticed, since the numbers may in some cases by twice the size as
the other text, that the text just messy and the numbers are not
included in the output. I somehow suspect that the OCR has a upper
threadshold limit on how large a character must be, compared to other
characters in the image, before it is considered not to be text, thus
discarded from the output.

Since I am both new to the concept of OCR as well as our team don't
have members with a strong background in image processing, I am a bit
unsure how to tackle this issue.

The images I have been using for my initial tests may be found at

http://www.ge.tt/4KxYgTU
http://ge.tt/29sDgTX

I should mention that the image test.tif is at its original size and
gives bogus results, where as the test_01.tif is scaled about 33% of
the original size and produces the correct result. However, it is not
a viable solution just to scale the images, since this may lead to
incorrect results for the rest of the document (note: not everything
is included in these images due to privacy restrictions).

I should also mention, that the old microsoft COM component had no
problems handling these images.

I hope that I can get some feedback on this matter.

Cheers,
Kind regards
Søren Engel

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Trouble recognizing characters in images with different character size

Reply via email to