This is a known issue with Tesseract. One solution is to process the
OCR results then detect the size discrepancy between the two parts of
the line and then re-process each part as a separate image. In
essence, doing that prevents Tesseract from drawing bad inferences.

I think Tesseract 3.01 brings improvements to this issue from initial
tests I have run. I will try your images and post back here with my
findings.

Patrick

On Mar 9, 9:56 am, Søren Engel <soren.en...@gmail.com> wrote:
> Hello fellow members,
>
> I am currently working on a upgrading an old OCR module at our
> development team which was originally written in VB using the
> integreted OCR components within Microsoft Office 2003. Since this is
> discontinued (as Microsoft was discarded the COM components from the
> office distributions), I have been trying to locate a fitting module
> to take to spot - leading me to this site and binaries.
>
> My problem now stands as follows, while performing a OCR on a basic
> invoice coming from a fax using the tesseract-ocr commandline app,
> whereas the invoice contains a sentence like
>
> -- contract number: 214587 --
>
> I noticed, since the numbers may in some cases by twice the size as
> the other text, that the text just messy and the numbers are not
> included in the output. I somehow suspect that the OCR has a upper
> threadshold limit on how large a character must be, compared to other
> characters in the image, before it is considered not to be text, thus
> discarded from the output.
>
> Since I am both new to the concept of OCR as well as our team don't
> have members with a strong background in image processing, I am a bit
> unsure how to tackle this issue.
>
> The images I have been using for my initial tests may be found at
>
> http://www.ge.tt/4KxYgTUhttp://ge.tt/29sDgTX
>
> I should mention that the image test.tif is at its original size and
> gives bogus results, where as the test_01.tif is scaled about 33% of
> the original size and produces the correct result. However, it is not
> a viable solution just to scale the images, since this may lead to
> incorrect results for the rest of the document (note: not everything
> is included in these images due to privacy restrictions).
>
> I should also mention, that the old microsoft COM component had no
> problems handling these images.
>
> I hope that I can get some feedback on this matter.
>
> Cheers,
> Kind regards
> Søren Engel

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to