I am in the same situation.  Here is what I have experienced.  It
helps to remove non-text from the image, such as underlining,
graphics, boxes, lines, shading.  Grayscale and black-and-white images
work better than color, I have heard.  If you follow the training
document and make a box file from your input image, then view with
bbtesseract, you can see where tesseract is going astray better.  You
might need to do some thresholding on some web pages.  Some of my
other posts talk about related things.

Can you post a summary of what kind of resizing you have tried.  Are
there any that work better than cubic for some cases?

On Nov 27, 2:54 am, philip <philip14...@gmail.com> wrote:
> Hi,
>
> I am doing text recognition of small fonts. Typically at the size you
> see on web-pages.
>
> I found that if I resize the image to enlarge it by three or even up
> to five times the size, I use cubic interpolation resize in Gimp, this
> improves the recognition of text by this program greatly.
>
> Is there any other image pre-processing I could do to improve
> recognition rates?
>
> Thanks, Philip

--

You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.


Reply via email to