Not sure if I'm being helpful, but it sounds like either your input image is noisy or thresholding algorithm incorrectly separated foreground from background. If it's former, noise reduction of original image would help. If latter, you probably need to choose thresholding algorithm more appropriate for your input image.
That said, I don't know how to suppress small rows efficiently. Andrei On Jan 17, 11:55 am, patrickq <[email protected]> wrote: > I am scanning images with large, clear text but on a grainy background > and although I get the text fine, I also get myriads of irrelevant > letters with a size of 3 or 5 pixels (way below a size at which > anything could be recognized accurately). I could eliminate them based > on size post-OCR but meanwhile Tesseract spent minutes recognizing > these characters. Could someone please point me to the right variable > (s) to tell Tesseract to not attempt recognition (and ideally not > return boxes at the layout analysis phase) below a certain size? > > I assume that the variable in question regards the min expected height > of a row (rather than of individual characters) since a dot ('.') for > example can be quite small even within a row with normal sized > letters. > > Thanks!
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

