Hello fellow members, I am currently working on a upgrading an old OCR module at our development team which was originally written in VB using the integreted OCR components within Microsoft Office 2003. Since this is discontinued (as Microsoft was discarded the COM components from the office distributions), I have been trying to locate a fitting module to take to spot - leading me to this site and binaries.
My problem now stands as follows, while performing a OCR on a basic invoice coming from a fax using the tesseract-ocr commandline app, whereas the invoice contains a sentence like -- contract number: 214587 -- I noticed, since the numbers may in some cases by twice the size as the other text, that the text just messy and the numbers are not included in the output. I somehow suspect that the OCR has a upper threadshold limit on how large a character must be, compared to other characters in the image, before it is considered not to be text, thus discarded from the output. Since I am both new to the concept of OCR as well as our team don't have members with a strong background in image processing, I am a bit unsure how to tackle this issue. The images I have been using for my initial tests may be found at http://www.ge.tt/4KxYgTU http://ge.tt/29sDgTX I should mention that the image test.tif is at its original size and gives bogus results, where as the test_01.tif is scaled about 33% of the original size and produces the correct result. However, it is not a viable solution just to scale the images, since this may lead to incorrect results for the rest of the document (note: not everything is included in these images due to privacy restrictions). I should also mention, that the old microsoft COM component had no problems handling these images. I hope that I can get some feedback on this matter. Cheers, Kind regards Søren Engel -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.