I am in the same situation. Here is what I have experienced. It helps to remove non-text from the image, such as underlining, graphics, boxes, lines, shading. Grayscale and black-and-white images work better than color, I have heard. If you follow the training document and make a box file from your input image, then view with bbtesseract, you can see where tesseract is going astray better. You might need to do some thresholding on some web pages. Some of my other posts talk about related things.
Can you post a summary of what kind of resizing you have tried. Are there any that work better than cubic for some cases? On Nov 27, 2:54 am, philip <philip14...@gmail.com> wrote: > Hi, > > I am doing text recognition of small fonts. Typically at the size you > see on web-pages. > > I found that if I resize the image to enlarge it by three or even up > to five times the size, I use cubic interpolation resize in Gimp, this > improves the recognition of text by this program greatly. > > Is there any other image pre-processing I could do to improve > recognition rates? > > Thanks, Philip -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.